Story visualization, which is the task of generating coherent images based on narratives, has made significant progress with the emergence of text-to-image models, especially the application of diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenges in long story visualization (i.e., generating up to 100 frames).

Introduction 🦖
its superiority in improving semantic consistency and the ability to generate fine-grained interactions, especially in long story scenarios.

Framework 🤖
The figure below illustrates the proposed iterative paradigm, including initialization, the iterative process within Story-Adapter, and the implementation of Global Reference Cross-Attention (GRCA). Story-Adapter first visualizes each image solely based on the textual prompt of the story and uses all generated results as reference images for future rounds. During the iterative process, Story-Adapter integrates GRCA into SD. For the i-th image visualization in each round, GRCA aggregates information streams from all reference images during the denoising process via cross-attention. All results from each iteration will serve as reference images, guiding dynamic updates in story visualization during the next round of iterations.

Performance 🎨
Conventional-length story visualization
Long-form story visualization

Comparison 📊
better meets the requirements for effective story visualization.

Trial use 🚀
https://colab.research.google.com/drive/1sFbw0XlCQ6DBRU3s2n_F2swtNmHoicM-