The author of ControlNet's latest project Omost - using LLM for image synthesis

The author of ControlNet has recently launched a new research project called Omost.

https://github.com/lllyasviel/Omost

Omost aims to transform the encoding capabilities of large language models (LLMs) into image generation capabilities (more precisely, image synthesis).

The name has two meanings:

After each use of Omost, your image is "almost" complete;
which means we hope to fully tap its potential.

can be rendered through a specially implemented image generator to produce the final image.

Currently, the author provides three pre-trained LLM models based on variants of Llama3 and Phi3 (for specific model descriptions, please refer to the model notes at the end of the page). All models have been trained with the following data mixtures:

Real labeled data from multiple datasets including Open-Images;
Data extracted from automatically labeled images;
Reinforcement data from DPO (Direct Preference Optimization, with "whether the code can compile on Python 3.10" as a direct preference);
A small amount of fine-tuning data provided by the multimodal capabilities of OpenAI's GPT4o.

Through these pre-trained models, users can efficiently generate and synthesize image content.

You can run the demo at https://huggingface.co/spaces/lllyasviel/Omost to experience the entire Omost process.

Input a sentence as the prompt：

a dog and a cat

Then Omost starts to work：
And then render the image：
, for example, the dog is a Teddy dog

the dog is a Teddy dog

Omost then expands further：
Finally render the image