【Microsoft Paper】Agent AI, Holistic Intelligence, Large Foundation Model (LFM)

Today I read a paper published by Microsoft Lab in February this year: "Position Paper: Agent AI Towards a Holistic Intelligence". This paper was sent to me earlier by my wife 🌱. The paper proposes a theory about the Agent AI system. This system can be applied across multiple domains and provides a foundational model for interoperability and embodied operations. Agent AI, by leveraging multimodal data obtained through interactions across diverse environments, operates in both physical and virtual worlds. Agent AI demonstrates a promising approach towards unifying infrastructure and enabling widespread applications and capabilities within the system. Moreover, it is gradually being seen as a hopeful path towards holistic intelligence (Holistic Intelligence, HI).

It feels like an attempt to create an all-encompassing model, advancing using scaling laws.

Agent AI Paradigm

The paper mentions an Agent AI paradigm used to support embodied multimodal generalist agent systems. This paradigm includes five main modules:

Agents and their perception in the environment, task planning, and observation
Agent learning
Memory
Action
Cognition and consciousness

The tight integration of these components contributes to the development of holistic intelligence. A key difference from previous interactive strategies is that after training, the actions of the agent will directly influence task planning, allowing subsequent actions to be planned without receiving feedback from the environment.

Agent AI Consciousness

Agent AI can go beyond simple component synergy and may even involve a form of "consciousness." In recent challenging attempts based on neuroscience insights to find artificial intelligence consciousness, neuroscientists have discussed agency and embodiment as indicators of consciousness.

Our Agent AI predicts optimal actions based on language (i.e., text instructions), sensory inputs, and action history, achieving agency through generating goal-directed actions. It also learns from the relationship between its actions and environmental outcomes, realizing the principle of embodiment. Therefore, we can potentially quantify various aspects of Agent AI consciousness, indicating its potential in multiple disciplines such as neuroscience, biology, physics, biophysics, cognitive science, healthcare, and moral philosophy.

Agent AI Model

Agent AI Transformer

An overview of a foundation model framework for interactive agents. This transformer is designed to process multimodal information at different levels of abstraction. This method helps comprehensively understand context, thereby enhancing the coherence of actions. By learning across various task domains and applications, the adaptability and effectiveness of the model are enhanced.

Agent AI Learning Strategies

Reinforcement Learning (RL)
Imitation Learning (IL)
Traditional RGB (Red, Green, Blue)

Agent AI Application Tasks

Robotics

Robots are representative agents that need to interact effectively with the environment. In this section, we will introduce key elements crucial for efficient robot operation, review research topics applying the latest large foundation models, and share insights from recent studies.

Gaming

Gaming provides a unique sandbox to test agent behavior of large foundation models, pushing the limits of their collaborative and decision-making abilities. We particularly describe three areas that highlight the agents' ability to interact with human players and other agents, as well as their ability to take meaningful actions in the environment.

Interactive Healthcare

In the healthcare field, Agent AI can help patients and doctors understand user intent, retrieve clinical knowledge, and grasp ongoing interpersonal interactions using large foundation models, but not limited to these areas.

Interactive Multimodal Tasks

The integration of vision and language understanding is fundamental to Agent AI. Therefore, the development of Agent AI is closely related to the performance of multimodal tasks, including image captioning, visual question answering, video language generation, and video understanding.