VISTA: A Test-Time Self-Improving Video Generation Agent
1Google    2National University of Singapore
* This work was done while Do Xuan Long was a Student Researcher at Google.
📧 Contact: xuanlong.do@u.nus.edu, xingchenw@google.com, soarik@google.com

Abstract

Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 66.4% of comparisons.

Overview of VISTA

VISTA Overview

VISTA is a modular, configurable framework for optimizing text-to-video generation. Given a user video prompt P, it produces an optimized video V* and its refined prompt P* through two phases: (i) Initialization and (ii) Self-Improvement, inspired by the human video optimization process via prompting. During (i), the prompt is parsed and planned into variants to generate candidate videos (Step 1), after which the best video-prompt pair is selected (Step 2). In (ii), the system generates multi-dimensional, multi-agent critiques (Step 3), refines the prompt (Step 4), produces new videos, and reselects the champion pair (Step 2). This phase continues until a stopping criterion is met or the maximum number of iterations is reached.

Example Videos Generated by Veo 3

Direct Prompting (Original Prompt): A spaceship entering hyperdrive, stars streaking past as it accelerates.
VISTA (Ours): Create an 8-second, live-action, realistic sci-fi action video. The sequence initiates with a medium shot of a sleek, advanced spaceship...
Direct Prompting (Original Prompt): A single ice cube placed in a warm drink, slowly melting and sending gentle ripples through the liquid as it transforms.
VISTA (Ours): **Video Production Brief: Melting Ice Cube (8 Seconds)** - To create an 8-second video illustrating the User Prompt: "A single ice cube placed in a warm drink, slowly melting and sending gentle ripples through the liquid as it transforms."...
Direct Prompting (Original Prompt): The video is an educational animation designed for young children, teaching compound words through a visual and auditory puzzle format...
VISTA (Ours): Create an 8-second animated educational video for young children, designed as an engaging visual and auditory puzzle to teach compound words...
Direct Prompting (Original Prompt): An 8-second video begins outdoors on a bright, sunny day. A bearded man in a red cap, blue-tinted sunglasses, purple hoodie, and black headphones addresses the camera...
VISTA (Ours): The 8-second video opens outdoors on a vibrant, sunlit day. For the initial 5.5 seconds, the camera presents a static, chest-up, eye-level view of a man...
Direct Prompting (Original Prompt): An aerial view of a lush, green forest with a river winding through it, highlighting the contrast between the dense foliage and the clear water.
VISTA (Ours): Produce an 8-second establishing shot video centered on an aerial view of a lush, green forest traversed by a winding river...
Direct Prompting (Original Prompt): The person's forehead creased with worry as he listened to bad news.
VISTA (Ours): Generate an 8-second photorealistic video depicting a man's intense emotional reaction upon hearing distressing news. The video commences with a medium shot...
Direct Prompting (Original Prompt): Craft an 8-second video showcasing the unbridled joy of children at play in a supermarket environment. A grey shopping cart holds a mischievous toddler...
VISTA (Ours): From the comprehensive JSON video script, specific directives, and general constraints detailed below, create an 8-second video. The video captures a heartwarming and amusing interaction between a toddler seated in a shopping cart...
Direct Prompting (Original Prompt): A couple runs through a sudden downpour, laughing and splashing in puddles as they try to find shelter.
VISTA (Ours): Produce an 8-second live-action video depicting the scenario: "A couple runs through a sudden downpour, laughing and splashing in puddles as they try to find shelter."...

Citation

If you find our work useful, please consider citing:

@article{long2025vista,
  title={VISTA: A Test-Time Self-Improving Video Generation Agent},
  author={Long, Do Xuan and Wan, Xingchen and Nakhost, Hootan and Lee, Chen-Yu and Pfister, Tomas and Arik, Sercan O},
  journal={arXiv preprint arXiv:2510.15831},
  year={2025},
  url={https://arxiv.org/abs/2510.15831}
}

We acknowledge some of the prompts above being from MovieGenVideo (Meta).

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.