Microsoft's Magma: The First Foundation Model for Multimodal AI Agents

As the trend of artificial intelligence evolving from single-modal to multi-modal fusion gradually becomes apparent, Magma emerges as the first foundational model specifically designed for multi-modal agent tasks. It not only deeply perceives the real world through multi-modal interactions but also accurately executes goal-oriented actions. By efficiently transferring knowledge from a vast amount of visual and linguistic data, Magma successfully integrates language, spatial, and temporal intelligence, enabling it to flexibly handle complex tasks in both digital and physical environments.

Magma Pre-training Process

The pre-training process of Magma is clear and efficient: all textual information in the training data is uniformly encoded into discrete tokens, while image and video data from different domains are processed through a shared visual encoder. The encoded discrete and continuous tokens are then integrated and fed into a large language model (LLM), which outputs results of three types: language, spatial, and action.

This innovative pre-training method enables Magma to truly achieve seamless integration between vision, language, and action, paving the way for more general and intelligent AI agents in the future.

Comprehensive Technical Details Analysis

🖼️ Set-of-Mark (SoM) and Action Localization

The Set-of-Mark (SoM) prompting method allows the model to efficiently perform action localization in image space, applicable to various scenarios: user interface (UI) screenshots (e.g., recognizing clickable buttons), robotic operations (identifying the position of a robotic arm), and human action videos (recognizing key points on the human body). The model achieves unified action localization by predicting numerical marks in the image space.

🎬 Trace-of-Mark (ToM) and Action Planning

Trace-of-Mark (ToM) is used for video supervision of robotic operations and human actions, enabling the model to understand dynamic changes in videos and predict future states before executing actions. Compared to frame-by-frame prediction, this method uses fewer marks to capture longer time spans and action-related dynamics, effectively filtering out environmental noise.

📚 Sources of Pre-training Data

Magma's pre-training data is rich and diverse, covering instructional videos, robotic operations, UI navigation, and multi-modal understanding tasks. SoM is widely applied to various types of data, achieving cross-modal unified action localization; while ToM is mainly targeted at video and robotic domain data.

🎮 Magma Game Intelligence

Magma demonstrates excellent multi-modal gaming capabilities, such as automatically moving up, down, left, and right to collect green blocks in 2D games, showcasing its outstanding real-time perception and action planning abilities.

⚔️ Magma Agent vs. Other Mainstream Models

In zero-shot evaluation of agent tasks, Magma can competently handle all tasks without domain-specific fine-tuning, outperforming mainstream models such as LLaVA-OneVision, Qwen2-VL, and GPT-4o mini, demonstrating comprehensive agent versatility.

📱 User Interface Navigation

Magma can precisely complete complex UI navigation tasks, including checking the weather in Seattle, turning on airplane mode, and sending messages to specific contacts. Through efficient fine-tuning on the Mind2Web and AITW datasets, Magma has achieved outstanding performance in web and mobile UI navigation.

🦾 Robotic Operations

In the WidowX real robot video test, Magma's few-shot fine-tuning performance far exceeds that of OpenVLA, demonstrating excellent generalization ability in manipulating soft objects and performing pick-and-place tasks, both within and outside the domain.

Experiments on the LIBERO dataset also confirm that Magma's success rate is significantly higher, and removing SoM and ToM marks noticeably reduces model performance.

In zero-shot tests on Google robots and the Bridge simulation environment (SimplerEnv), Magma exhibits strong cross-domain and cross-form robot generalization, achieving impressive generalization performance.

🧠 Spatial Reasoning

Spatial reasoning tasks often challenge high-level models like GPT-4o, but even with a smaller scale of pre-training data, Magma still performs excellently, demonstrating prominent spatial intelligence.

📹 Multi-modal Understanding

In zero-shot benchmarks for video question answering (QA), Magma performs on par with or even surpasses advanced models such as Video-Llama2 and ShareGPT4Video, using less video instruction fine-tuning data, further proving Magma's efficiency and strong generalization ability.

Example Demonstrations

The following are typical application examples of the Magma model in real multi-modal video dialogue scenarios, demonstrating the model's powerful multi-modal understanding, prediction, and interaction capabilities.

♟️ Game Advice Scenario

User: I really want to win this game. Can you suggest my next move to maximize my chances of winning?
Magma： It is recommended to move your pawn to the d4 position, which will maximize your chances of winning the game.

🛋️ Leisure Advice Scenario

User: I want to relax here for a few hours. Can you suggest some things I can do?
Magma： You can watch TV, read a book, or enjoy the beautiful view of the autumn forest outside the window.

🛒 Store Scene Detailed Description

User: Please provide a detailed description of the video content, paying particular attention to the main characters and their actions, and predict what might happen next.
Magma： The video scene takes place at a convenience store checkout counter. The "Pick 5 for $8" promotional sign is prominently displayed, and the counter is filled with various snacks and a payment terminal. An employee wearing a black shirt with white sleeves is operating the payment terminal, and a customer places a red shopping basket full of snacks on the counter. The employee continues to operate the terminal and helps the customer organize the items in the basket. Next, the employee is likely to complete the payment for the customer and hand over the items.

🌱 Home Plant Arrangement Scene

User: Please describe the main actions in the video in detail.
Magma： The video begins with a close-up of a hand reaching towards a red potted plant placed on a wooden bookshelf. The plant has green and yellow striped leaves, and next to it is a book titled "YOUNG" and a stacked wooden bowl. The hand carefully adjusts the position of the plant to ensure it is properly placed. In the background, a beige watering can and radiator pipes can be seen. The hand then makes another slight adjustment to the plant's position and slowly moves away, with the camera gradually widening to show a red high stool and a wooden tabletop, on which a smartphone and a red cup are placed. Finally, the view expands to reveal the checkered floor and other plants in the room.

🧽 Kitchen Cleaning Scene

User: Please describe the main actions in the video in detail.
Magma： The video shows a close-up of the kitchen sink area, where a hand is washing a black cup under running water. Nearby, there is dish soap and a blue toothbrush. The person then picks up a blue sponge and carefully cleans the inside of the cup. After completing the cleaning, the cup and sponge are placed back beside the sink.

🥤 Beverage Selection Recognition

User: What beverage did the person in the video choose?
Magma： The person chose a soy milk drink labeled "SAVE SOY."

🍵 Subsequent Action Prediction

User: Can you predict what the person will do next after the video?
Magma： The person is likely to pour hot water into the cup and continue preparing the tea.

☕ Coffee Machine Interaction Scene

User: Could you describe what the person in the video is doing?
Magma： The person is interacting with a coffee machine, pressing different buttons, possibly selecting the type of coffee they want to make.

Try Out

https://huggingface.co/spaces/microsoft/Magma-Gaming

https://huggingface.co/spaces/microsoft/Magma-UI