DeepSeek Janus Series: Unified Multimodal Understanding and Generation Models

(Details see ⬇️), compared with other closed-source or open-source solutions currently available, there is still a gap.

Brief introduction of three models

Janus: Visual decoupling for multimodal understanding and generation

is an innovative autoregressive framework dedicated to unifying multimodal understanding and generation. Its unique advantages include:

: By separating the visual encoding path, it effectively alleviates the conflict between generation and understanding while still adopting a unified Transformer architecture.
: The decoupled design enhances the flexibility of the framework, enabling it to surpass traditional unified models in multimodal tasks and match task-specific models.

Technical features：

Simplified design: Reducing architectural complexity.
Efficiency: Performs excellently across multiple tasks, becoming a strong candidate for next-generation multimodal models.

Model download links and paper addresses：

Janus-1.3B：https://huggingface.co/deepseek-ai/Janus-1.3B
Paper：https://arxiv.org/abs/2410.13848

JanusFlow: Harmonious unification of autoregression and correction flow

. Its technical highlights include:

: No complex architectural modifications required, allowing direct training of rectified flows within the large language model framework.
: Achieves performance comparable or even superior to specialized models in visual and language tasks.
: Significantly surpasses the benchmark performance of existing unified methods.

Model download links and paper addresses：

JanusFlow-1.3B：https://huggingface.co/deepseek-ai/JanusFlow-1.3B
Paper：https://arxiv.org/abs/2411.07975

Janus-Pro: A multimodal unified framework for data and model expansion

is the advanced version of Janus, integrating the following core improvements:
: Significantly improves generalization ability and stability.
: Enriches the scenarios and diversity of multimodal inputs.
: Enhances the model's understanding ability and generation stability.
These improvements enable Janus-Pro to excel in the following areas:
: Achieves industry-leading performance in complex tasks.
: More efficiently executes complex instructions and generates stable high-quality images.
Model download links and paper addresses：
Janus-Pro-1B：https://huggingface.co/deepseek-ai/Janus-Pro-1B
Janus-Pro-7B：https://huggingface.co/deepseek-ai/Janus-Pro-7B
Paper：https://github.com/deepseek-ai/Janus/blob/main/janus_pro_tech_report.pdf

Trial picture understanding

: Understand this meme

Janus

This meme humorously contrasts two types of AI assistants: on the left is someone using an OpenAI assistant, with a camera mounted on their head, implying surveillance bias; on the right is someone using a DeepSee assistant, holding a camera pointed at the screen, implying focus on visual analysis and data collection. The meme satirizes OpenAI's emphasis on surveillance, while DeepSee focuses more on visual analysis.

JanusFlow

OpenAI and DeepSeek are two AI models designed to help humans complete various tasks. OpenAI is a general-purpose large language model that can be used for chatbots, language translation, and text generation in a wide range of applications; whereas DeepSeek focuses more on searching for specific information from large datasets. These two models each have their own strengths and weaknesses, often combined to accomplish more complex tasks.

Janus-Pro

This image is a humorous meme that visually metaphorically compares the two AI systems "OpenAI" and "DeepSeek".
Left Side (OpenAI):
A person wearing a helmet with multiple cameras or lenses, aiming at a target. This symbolizes OpenAI being more "focused" or "precise" on specific tasks or goals.
Right Side (DeepSeek) Overall Meaning:
A person aiming with a gun, but the target is invisible. This implies that although DeepSeek "aims" in a certain direction, the specific goal or result is not yet clear.
Overall Meaning:
The meme humorously implies that OpenAI is more focused and directed towards a specific goal, while DeepSeek is more ambiguous or uncertain in its approach.
The meme humorously suggests that OpenAI is more clearly oriented towards specific goals, while DeepSeek's approach appears more ambiguous or uncertain.

: Spot the differences

Correct answer：

Janus

No difference (Incorrect ❌)

JanusFlow

The image shows a girl sitting at a table with a bowl of ice cream and an ice cream cone. There are two differences between the two images / This picture shows a girl sitting at a table with a bowl of ice cream and an ice cream cone. The differences between the two pictures are as follows:
The girl's hair is slightly different in the second image. In the first image, her hair is straight, while in the second image, her hair is slightly wavy. The girl's hair in the second picture is slightly different. In the first picture, her hair is straight, while in the second picture, her hair is slightly wavy. (Incorrect ❌)
The girl is wearing a pink bow in her hair in the second image, which is not present in the first image. In the second picture, the girl is wearing a pink bow in her hair, which is not present in the first picture. (Correct ✅)

Janus-Pro

The difference between the two images is the color of the ice cream scoops. In the left image, the ice cream scoops are green, while in the right image, they are white. The difference between the two pictures is the color of the ice cream scoops. In the left picture, the ice cream scoops are green, while in the right picture, the ice cream scoops are white. (Incorrect ❌)

: Solve math problems

: chatGPT o1 and o1 Pro can solve it.

Janus

To solve this math problem, we need to determine the value of ( h_1(-2) ).
First, we need to find the value of ( h_1(-2) ).
Given the equation:
[ A^2 = 24^2 - 48t + 2, h_1(-2) = -9 ]
We can substitute ( A^2 = 24^2 - 48t ) into the given equation:
[ 24^2 - 48t + 2 = -9 ]
Next, we need to solve the equation for ( t ):
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t + 2 = -9 ]
[ 24^2 - 48t +

(Incorrect ❌)

JanusFlow

To solve this math problem, we need to find the value of ( h_1(1) ) given that ( h_1'(1) = 24t^2 - 48t + 2 ) and ( h_1(-2) = -9 ).
First, we need to find ( h_1'(1) ):
[ h_1'(1) = 24t^2 - 48t + 2 ]
Next, we need to find ( h_1(-2) ):
[ h_1(-2) = -9 ]
Now, we can substitute ( h_1'(1) ) and ( h_1(-2) ) into the equation:
[ 24t^2 - 48t + 2 - 9 = -9 ]
Simplifying the equation:
[ 24t^2 - 47t + 1 = -9 ]
Subtract 1 from both sides:
[ 24t^2 - 47t + 1 - 1 = -9 ]
[ 24t^2 - 46t + 2 = -9 ]
Now, we need to find ( h_1(1) ):
[ h_1(1) = 24t^2 - 46t + 2 ]
Substitute ( h_1(1) ) back into the equation we derived:
[ 24t^2 - 46t + 2 = 24t^2 - 46t + 2 ]
Simplify the equation:
[ 24t^2 - 46t + 2 = 24t^2 - 46t + 2 ]
This equation is already correct, so there is no need to solve it further. The solution is:
[ h_1(1) = 24t^2 - 46t + 2 ]

(Incorrect ❌)

Janus-Pro

We have the equation (24^2 - 48t + 2h = -9) and (h(-2) = -4).
First, let's simplify the equation (24^2 - 48t + 2h = -9).
[
\begin{align*}
24^2 - 48t + 2h &= -9 \
(24)^2 - 48t + 2h &= -9 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48t + 2h + 9 &= 0 \
(24)^2 - 48

(Incorrect ❌)

: Count how many cars are in the parking lot

: 18 cars (this was previously given an incorrect answer by ChatGPT o1 Pro)

Janus

There are 20 cars in this parking lot.

(Incorrect ❌)

JanusFlow

There are 21 cars in the parking lot.

(Incorrect ❌)

Janus-Pro

There are 14 cars in the parking lot.

(Incorrect ❌)

Trial image generation

Prompt/keywords：

The image features an intricately designed eye set against a circular backdrop adorned with ornate swirl patterns that evoke both realism and surrealism. At the center of attention is a strikingly vivid blue iris surrounded by delicate veins radiating outward from the pupil to create depth and intensity. The eyelashes are long and dark, casting subtle shadows on the skin around them which appears smooth yet slightly textured as if aged or weathered over time.

Above the eye, there's a stone-like structure resembling part of classical architecture, adding layers of mystery and timeless elegance to the composition. This architectural element contrasts sharply but harmoniously with the organic curves surrounding it. Below the eye lies another decorative motif reminiscent of baroque artistry, further enhancing the overall sense of eternity encapsulated within each meticulously crafted detail.

Overall, the atmosphere exudes a mysterious aura intertwined seamlessly with elements suggesting timelessness, achieved through the juxtaposition of realistic textures and surreal artistic flourishes. Each component—from the intricate designs framing the eye to the ancient-looking stone piece above—contributes uniquely towards creating a visually captivating tableau imbued with enigmatic allure.

Janus

JanusFlow

Demo has bugs and did not run successfully

Janus-Pro

Prompt/keywords：

A charming, cute girl with big sparkling eyes, soft pastel-colored hair (e.g., pink, lavender, or mint green), wearing a stylish outfit with subtle frills and bows, standing in a dreamy background filled with soft lighting, cherry blossoms, and gentle gradients. The atmosphere is cheerful and heartwarming, with warm, glowing highlights and delicate details in the surroundings, anime-inspired style

Janus

JanusFlow

Demo has bugs and did not run successfully

Janus-Pro

More image examples

Janus

JanusFlow

Janus-Pro

Evaluation

Janus

Benchmark test performance

Visual generation results

JanusFlow

Benchmark test performance

Visual generation results

Janus-Pro

Average performance in four multimodal understanding benchmark tests

Performance in text-to-image generation instruction-following benchmark tests

Technical framework

Janus

Unlike previous methods that typically assume visual understanding and generation share the same visual encoder, Janus decouples visual encoding into independent modules for visual understanding and visual generation. “Und. Encoder” and “Gen. Encoder” are the abbreviations for “Understanding Encoder” and “Generation Encoder,” respectively.

Janus' three-stage training process:

The goal is to establish conceptual connections between vision and language in the embedding space, enabling the model with preliminary visual generation capabilities. The visual encoder and LLM are frozen during this stage, updating only the understanding adapter, generation adapter, and image head.
Trained using multimodal corpora to allow the model to learn multimodal understanding and generation capabilities. The LLM is unfrozen, utilizing pure text data, multimodal understanding data, and visual generation data for training, starting visual generation training from ImageNet-1k and then expanding to open-domain text-to-image data.
Fine-tune the model using instruction tuning data to enhance its instruction-following and conversational abilities. All parameters except the generation encoder are unfrozen. The data mix includes pure text dialogues, multimodal understanding, and visual generation, ensuring the model's versatility across various scenarios.

JanusFlow

JanusFlow architecture: In visual understanding, the LLM generates responses through autoregressive prediction; in image generation, starting from Gaussian noise (𝑡=0), the LLM iteratively updates 𝑧𝑡 by predicting velocity vectors until 𝑡=1. For simplification, the VAE encoder, skip connections in generation, and the linear layer after 𝑓𝑒𝑛𝑐 are omitted.

JanusFlow’s three-stage training process:

Train randomly initialized linear layers, generation encoders, and generation decoders so these new modules work together with the pretrained LLM and SigLIP encoder, completing initialization.
Train the entire model (visual encoder frozen) using three types of data: multimodal understanding data, image generation data, and pure text data. Initially focus on multimodal understanding, later increasing the proportion of image generation data to adapt to diffusion model convergence needs.
Fine-tune the model using instruction tuning data (including dialogues, task-specific dialogues, high-quality text-to-image generation examples). Unfreeze the SigLIP encoder to improve instruction response capabilities for multimodal understanding and image generation tasks.

Janus-Pro

Janus-Pro architecture: Decouple visual encoding into multimodal understanding (Understanding Encoder, abbreviated as “Und. Encoder”) and visual generation (Generation Encoder, abbreviated as “Gen. Encoder”).

Compared to Janus, Pro optimizes the three stages as follows:

Increase the number of training steps in Stage I for sufficient training on the ImageNet dataset. Even with fixed LLM parameters, the model can effectively model pixel dependencies and generate reasonable images based on class names.
Remove ImageNet data and directly use ordinary text-to-image data to generate images based on dense descriptions, thereby improving training efficiency and overall performance.
In supervised fine-tuning, adjust the ratio of multimodal data, pure text data, and text-to-image data from 7:3:10 to 5:1:4. Slightly reducing the proportion of text-to-image data allows the model to maintain strong visual generation capabilities while improving multimodal understanding performance.