Story-Adapter generates coherent images based on long stories without training

Story visualization, which is the task of generating coherent images based on narratives, has made significant progress with the emergence of text-to-image models, especially the application of diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenges in long story visualization (i.e., generating up to 100 frames).

Introduction 🦖

its superiority in improving semantic consistency and the ability to generate fine-grained interactions, especially in long story scenarios.

Framework 🤖

The figure below illustrates the proposed iterative paradigm, including initialization, the iterative process within Story-Adapter, and the implementation of Global Reference Cross-Attention (GRCA). Story-Adapter first visualizes each image solely based on the textual prompt of the story and uses all generated results as reference images for future rounds. During the iterative process, Story-Adapter integrates GRCA into SD. For the i-th image visualization in each round, GRCA aggregates information streams from all reference images during the denoising process via cross-attention. All results from each iteration will serve as reference images, guiding dynamic updates in story visualization during the next round of iterations.