Story-Iter: A Training-free Iterative Paradigm




For Long Story Visualization






*Equal Contribution

Abstract

Story visualization, the task of generating coherent images based on a narrative, has seen significant advancements with the emergence of text-to-image models, particularly diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenging, especially in long story visualization (_i.e._, up to 100 frames). In this work, we introduce Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter's state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.

Story-Iter Architecture

Story-Iter framework. Illustration of the proposed iterative paradigm, which consists of initialization, iterations in Story-Iter, and implementation of Global Reference Cross-Attention (GRCA). Story-Iter first visualizes each image only based on the text prompt of the story and uses all results as reference images for the future round. In the iterative paradigm, Story-Iter inserts GRCA into SD. For the ith iteration of each image visualization, GRCA will aggregate the information flow of all reference images during the denoising process through cross-attention. All results from this iteration will be used as a reference image to guide the dynamic update of the story visualization in the next iteration.

Regular-length Story Visualization

A story of "Pigeon" visualized by our Story-Iter
A story of "Dinosaur and Traveler" visualized by our Story-Iter
A story of "Boy" visualized by our Story-Iter
A story of "Pepper" visualized by our Story-Iter
A story of "Gril" visualized by our Story-Iter
A story of "Animal Rescuer" visualized by our Story-Iter
A story of "City Monkey" visualized by our Story-Iter
A story of "Old Man and Monkey" visualized by our Story-Iter
A story of "The Boy's Journey" visualized by our Story-Iter
A story of "A Day for a Girl" visualized by our Story-Iter
A story of "Rain" visualized by our Story-Iter
A story of "Fruit" visualized by our Story-Iter

Long Story Visualization

A story of "Little Red Riding Hood" visualized by our Story-Iter
A story of "Emperor and the Nightingale" visualized by our Story-Iter
A story of "Robinson Crusoe" visualized by our Story-Iter
A story of "Snowman" visualized by our Story-Iter
A story of "Loyal Dog" visualized by our Story-Iter
A story of "The Tortoise and the Hare" visualized by our Story-Iter
A story of "Winnie the Pooh" visualized by our Story-Iter
A story of "Pirate" visualized by our Story-Iter
A story of "Lonely Me" visualized by our Story-Iter
A story of "The Prince and the Princess" visualized by our Story-Iter

Qualitative Comparison of Different Methods

Qualitative comparison of story visualization shows AR-LDM and StoryGen generate coherent image sequences but degrade with story length due to autoregressive errors. StoryDiffusion and Story-Iter perform well, though StoryDiffusion struggles with subject consistency and ID image flaws due to high computation demands. Story-Iter better meets the requirements for effective story visualization.

BibTeX

If you find our work helpful for your research, please consider giving a citation 📃


@misc{mao2024story_adapter,
  title={{Story-Adapter: A Training-free Iterative Framework for Long Story Visualization}},
  author={Mao, Jiawei and Huang, Xiaoke and Xie, Yunfei and Chang, Yuanqi and Hui, Mude and Xu, Bingjie and Zhou, Yuyin},
  journal={arXiv},
  volume={abs/2410.06244},
  year={2024},
}