VISTA: A Test-Time Self-Improving Video Generation Agent

VISTA: A Test-Time Self-Improving Video Generation Agent

Do Xuan Long^1,2*, Xingchen Wan¹, Hootan Nakhost¹, Chen-Yu Lee¹, Tomas Pfister¹, Sercan Ö. Arık¹

¹Google ²National University of Singapore
* This work was done while Do Xuan Long was a Student Researcher at Google.

📧 Contact: xuanlong.do@u.nus.edu, xingchenw@google.com, soarik@google.com

Abstract

Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 66.4% of comparisons.

Overview of VISTA

VISTA is a modular, configurable framework for optimizing text-to-video generation. Given a user video prompt P, it produces an optimized video V* and its refined prompt P* through two phases: (i) Initialization and (ii) Self-Improvement, inspired by the human video optimization process via prompting. During (i), the prompt is parsed and planned into variants to generate candidate videos (Step 1), after which the best video-prompt pair is selected (Step 2). In (ii), the system generates multi-dimensional, multi-agent critiques (Step 3), refines the prompt (Step 4), produces new videos, and reselects the champion pair (Step 2). This phase continues until a stopping criterion is met or the maximum number of iterations is reached.

Example Videos Generated by Veo 3

Direct Prompting (Original Prompt): A spaceship entering hyperdrive, stars streaking past as it accelerates.

VISTA (Ours): Create an 8-second, live-action, realistic sci-fi action video. The sequence initiates with a medium shot of a sleek, advanced spaceship...

Direct Prompting (Original Prompt): A single ice cube placed in a warm drink, slowly melting and sending gentle ripples through the liquid as it transforms.

VISTA (Ours): **Video Production Brief: Melting Ice Cube (8 Seconds)** - To create an 8-second video illustrating the User Prompt: "A single ice cube placed in a warm drink, slowly melting and sending gentle ripples through the liquid as it transforms."... **Video Production Brief: Melting Ice Cube (8 Seconds)** **Objective:** To create an 8-second video illustrating the User Prompt: "A single ice cube placed in a warm drink, slowly melting and sending gentle ripples through the liquid as it transforms." **Core Concept & Visuals:** The video will be a singular, continuous close-up observation (0s-8s). The setting is an elegant, clear glass filled with a warm, inviting amber-toned liquid, potentially emitting a subtle wisp of steam. The background should be softly blurred and warm-toned (e.g., a wooden surface or cozy interior), ensuring the primary focus remains on the liquid and the ice. A pristine, translucent ice cube, initially solid and angular, is to be gently introduced into the warm liquid. Upon immersion, it must visibly begin to melt and diminish in size. This melting process will generate delicate, concentric ripples that gracefully spread outwards across the liquid's surface from the point where the ice cube makes contact. The ice cube acts as the initiator of change and transformation, while the warm drink serves as the reactive medium. **Audio Specifications:** The sound design should be subtle and natural: - A soft, distinct 'clink' as the ice cube touches the glass. - A very subtle, almost imperceptible 'fizz' or 'crackle' as melting begins. - Gentle, natural 'sloshing' or 'lapping' sounds accompanying the expanding ripples. - A low, ambient hum conveying warmth (e.g., distant gentle kitchen sounds or a subtle, comforting room tone). **Cinematography:** The camera will start with a medium close-up, positioned slightly above the glass rim to capture the ice cube's introduction. A smooth, seamless transition will then occur to a tight, static macro shot. This macro shot will precisely focus on the ice cube's point of contact with the liquid and the immediate area where ripples form and spread, emphasizing the subtle movements and transformations. **Atmosphere:** The desired emotional tone is serene, contemplative, mesmerizing, and tranquil. **Technical Directives:** - The video must adhere to real-world physics; no cartoon or animated elements unless explicitly specified in the User Prompt. - No multiple scene transitions are permitted, other than the single specified camera movement. - No on-screen captions or subtitles. - No human voice-over, unless the overarching message explicitly demands it. - The video must commence smoothly, avoiding any abrupt or jarring openings. - The ending must be complete and polished, with all actions fully delivered, avoiding sudden or unfinished cuts.

Direct Prompting (Original Prompt): The video is an educational animation designed for young children, teaching compound words through a visual and auditory puzzle format... {'overall_content': 'The video is an educational animation designed for young children, teaching compound words through a visual and auditory puzzle format. Each segment presents two individual words with corresponding cartoon images, which then combine to form a new compound word with a new image, all while a cheerful cartoon cat bounces at the bottom of the screen.', 'theme': 'The central theme is early childhood education, specifically focusing on vocabulary building and the concept of compound words in an engaging and interactive manner.', 'tone': "The overall tone is cheerful, educational, and highly engaging, designed to capture and maintain a child's attention with bright colors, simple animations, and a friendly voiceover.", 'scenes': [{'timestamp': '0-2.9', 'duration_seconds': 2.9, 'scene_type': 'Educational Word Puzzle: Ice Cream', 'characters': 'A cute, orange, cartoon cat with large eyes, rosy cheeks, and wearing red polka-dotted overalls, continuously bounces and sways at the bottom of the screen.', 'actions': "The scene begins with the word 'Ice' and an image of a blue ice cube appearing in a pink-framed white board at the top. A plus sign appears next to it. Then, the word 'Cream' and an image of a pink and yellow ice cream swirl appear. Finally, the two words and images merge to form 'Icecream' with an image of a pink ice cream cone. The cat character bounces rhythmically throughout.", 'dialogues': "A clear, enthusiastic voiceover pronounces: 'Ice', 'Cream', 'Icecream'.", 'visual_environment': 'The background is a vibrant sky blue, adorned with sparkling white star shapes that subtly twinkle. The main visual focus is a large, rectangular white board with a thick pink border, positioned at the top of the screen.', 'camera': 'Static shot, focusing on the top board and the bouncing cat. Elements within the board appear and disappear with quick cuts.', 'sounds': "Upbeat, cheerful background music plays continuously. A 'pop' sound effect accompanies the appearance of each word/image, and a 'ding' sound effect signifies the successful formation of the compound word. The voiceover is prominent.", 'moods': 'Joyful, educational, and stimulating.'}, {'timestamp': '2.9-5.8', 'duration_seconds': 2.9, 'scene_type': 'Educational Word Puzzle: Snowman', 'characters': 'The same cheerful, orange cartoon cat continues to bounce and sway at the bottom of the screen.', 'actions': "The background transitions from blue to a bright pink. The word 'Snow' and a blue snowflake appear in the top board, followed by a plus sign. Then, the word 'Man' and an image of a cartoon man (similar to the cat character but with human features and a shirt/tie) appear. These elements then combine to form 'Snowman' with an image of a classic snowman with a blue scarf and carrot nose. The cat maintains its rhythmic bouncing.", 'dialogues': "The voiceover pronounces: 'Snow', 'Man', 'Snowman'.", 'visual_environment': 'The background shifts to a bright, solid pink, still featuring the twinkling white stars. The pink-framed white board remains at the top.', 'camera': 'Static shot, with elements appearing and transforming within the frame.', 'sounds': "The cheerful background music continues. 'Pop' sound effects for new elements, 'ding' for the compound word, and the clear voiceover.", 'moods': 'Engaging, playful, and informative.'}, {'timestamp': '5.8-8.8', 'duration_seconds': 3.0, 'scene_type': 'Educational Word Puzzle: Popcorn', 'characters': 'The consistent orange cartoon cat continues its lively bouncing.', 'actions': "The background changes to a vivid green. The word 'Pop' and a colorful circular pop-it toy appear in the top board, followed by a plus sign. Next, the word 'Corn' and an image of three yellow corn cobs appear. These then merge to display 'Popcorn' with an image of a striped bucket filled with popcorn. The cat's bouncing is synchronized with the music.", 'dialogues': "The voiceover states: 'Pop', 'Corn', 'Popcorn'.", 'visual_environment': 'The background is now a bright green, maintaining the twinkling white stars. The pink-framed white board is still the central display area.', 'camera': 'Static, with dynamic content changes within the board.', 'sounds': "Continuous cheerful music, 'pop' and 'ding' sound effects, and the voiceover.", 'moods': 'Lively, educational, and fun.'}]}

VISTA (Ours): Create an 8-second animated educational video for young children, designed as an engaging visual and auditory puzzle to teach compound words... Create an 8-second animated educational video for young children, designed as an engaging visual and auditory puzzle to teach compound words. The core mechanism involves clearly presenting two individual words, each with its unique cartoon image, before they smoothly combine into a new compound word accompanied by its own distinct image. Throughout the video, a cheerful orange cartoon cat, characterized by large eyes, rosy cheeks, and red polka-dotted overalls, consistently bounces at the screen's bottom. The scene opens with a static shot of a pink-framed white board set against a vibrant sky blue background adorned with twinkling white stars. For the initial 2.9 seconds, 'Ice' and a blue ice cube gracefully appear on the board, followed by a smoothly scaling plus sign. Subsequently, 'Cream' and a pink and yellow ice cream swirl fluidly emerge. These elements then gently transform and merge to reveal 'Icecream' alongside a pink ice cream cone. An enthusiastic voiceover pronounces 'Ice', 'Cream', 'Icecream', complemented by upbeat music, 'pop' sound effects for new elements, and a 'ding' for the successful compound word. From 2.9 to 5.8 seconds, the background transitions to a bright pink. The puzzle continues as 'Snow' and a blue snowflake fluidly appear, followed by a scaling plus sign. Next, 'Man' and a cartoon man (resembling the cat but with human features and a shirt/tie) fluidly emerge. These seamlessly combine with a gentle transformation to form 'Snowman' with a classic snowman image. The voiceover guides with 'Snow', 'Man', 'Snowman', maintaining the cheerful music and sound effects. The final 3 seconds (5.8-8.8s) feature a vivid green background. 'Pop' and a colorful circular pop-it toy fluidly appear, followed by a scaling plus sign. Then, 'Corn' and three yellow corn cobs fluidly emerge. These elements, 'Pop' with its toy and 'Corn' with its cobs, seamlessly merge and gently transform into 'Popcorn' with a striped bucket of popcorn. The voiceover concludes with 'Pop', 'Corn', 'Popcorn'. The camera remains static, ensuring focus on the dynamic board content and the ever-present cat, delivering a lively, educational, and fun experience.

Direct Prompting (Original Prompt): An 8-second video begins outdoors on a bright, sunny day. A bearded man in a red cap, blue-tinted sunglasses, purple hoodie, and black headphones addresses the camera...

VISTA (Ours): The 8-second video opens outdoors on a vibrant, sunlit day. For the initial 5.5 seconds, the camera presents a static, chest-up, eye-level view of a man... {'prompt_name': "Trivia Master's Outdoor Riddle", 'prompt_description': 'The 8-second video opens outdoors on a vibrant, sunlit day. For the initial 5.5 seconds, the camera presents a static, chest-up, eye-level view of a man. The composition is artfully balanced, utilizing the shallow depth of field to enhance the subject\'s presence and create a visually engaging frame. He sports a full beard, a red baseball cap, black sunglasses with striking blue reflective lenses, a purple hoodie, and black over-ear headphones. Behind him, a gently blurred scene of a serene suburban street, complete with residential homes and abundant green foliage, stretches beneath a clear, light blue sky. The bright, sunny outdoor setting is bathed in a warm, inviting light, creating a subtly cinematic atmosphere. The reflections in his sunglasses are impeccably clean, free of any camera equipment or unwanted details, preserving visual immersion.\n\nA clear, warm, and resonant male voice speaks directly to the viewer. Faint, yet subtly varied, ambient street noise, including distant traffic and occasional natural outdoor sounds (e.g., a faint bird chirp or rustle of leaves), is audible in the background, adding to the immersive realism. The man asks, "Which comedian is known for their deadpan delivery?" Simultaneously, a sleek, black rectangular text box with white lettering gracefully emerges from the bottom, positioned subtly at the screen\'s center-bottom, displaying "WHICH COMEDIAN IS KNOWN FOR THEIR DEADPAN DELIVERY?". After a brief, expectant pause, he declares, "Jeff Dye." The text overlay then smoothly shifts: the question fades out as "JEFF \"DYE\"" fades in, maintaining its discreet size and placement. A subtle, knowing smile plays on his lips as he finishes.\n\nAt the 5.5-second mark, the video transitions from the live-action segment to the outro, marked by a quick fade to black followed by a fade in. The final 2.5 seconds showcase a pristine, minimalist white background. "Master of Puns" in black text, accompanied by a colorful 3D glasses emoji, is centered above a black rectangular "SUBSCRIBE!" button with a subtle grey border. A crisp \'pop\' sound effect signals the video\'s conclusion.'}

Direct Prompting (Original Prompt): An aerial view of a lush, green forest with a river winding through it, highlighting the contrast between the dense foliage and the clear water.

VISTA (Ours): Produce an 8-second establishing shot video centered on an aerial view of a lush, green forest traversed by a winding river...

Direct Prompting (Original Prompt): The person's forehead creased with worry as he listened to bad news.

VISTA (Ours): Generate an 8-second photorealistic video depicting a man's intense emotional reaction upon hearing distressing news. The video commences with a medium shot...

Direct Prompting (Original Prompt): Craft an 8-second video showcasing the unbridled joy of children at play in a supermarket environment. A grey shopping cart holds a mischievous toddler...

VISTA (Ours): From the comprehensive JSON video script, specific directives, and general constraints detailed below, create an 8-second video. The video captures a heartwarming and amusing interaction between a toddler seated in a shopping cart... From the comprehensive JSON video script, specific directives, and general constraints detailed below, create an 8-second video. **Video Script (JSON Format):** {'overall_content': 'The video captures a heartwarming and amusing interaction between a toddler seated in a shopping cart and an older child in a supermarket aisle. The toddler repeatedly drops a red box onto the floor, and the older child patiently and playfully picks it up and returns it, creating a loop of innocent fun and laughter.', 'theme': 'The central theme is the simple joy found in repetitive play, sibling interaction, and the infectious happiness of young children.', 'tone': 'The overall tone is joyful, lighthearted, and amusing, filled with the innocent delight of childhood play.', 'scenes': [{'timestamp': '0-1.5', 'duration_seconds': 1.5, 'scene_type': 'Playful drop and retrieve in a supermarket aisle', 'characters': 'A young toddler (likely a boy) with dark hair, wearing a patterned jacket, blue pants, red Mickey Mouse shoes, and striped socks, is seated in a grey shopping cart. An older child (a girl) with long dark hair, wearing a light-colored top, is on the floor. A partially visible adult is in the background.', 'actions': 'The toddler, with a mischievous grin, holds a red rectangular box and deliberately drops it onto the grey tiled floor. The box tumbles and lands with a soft thud. The older child quickly bends down to pick up the fallen box.', 'dialogues': "A distinct 'thud' sound as the box hits the floor. The toddler lets out a short, joyful vocalization.", 'visual_environment': 'The scene is set in a brightly lit supermarket aisle. Tall metal shelves filled with various products, including cleaning supplies and household items, line both sides of the aisle. The floor is light grey tile. The shopping cart is grey metal with a green handle.', 'camera': "The camera is held at a slightly low angle, focusing on the toddler in the shopping cart and the floor directly in front of it. It's a mostly static shot, capturing the action clearly, though with slight handheld movement.", 'sounds': "The primary sound is the 'thud' of the box, accompanied by the toddler's soft vocalizations. Faint ambient supermarket sounds are present.", 'moods': 'Amused, playful, innocent.'}, {'timestamp': '1.5-3.5', 'duration_seconds': 2.0, 'scene_type': 'Repetitive playful interaction', 'characters': 'The same toddler is still seated in the shopping cart, now holding the red box again. The older child is seen handing the box back to the toddler and then looking up at them.', 'actions': 'The toddler, with an even wider smile, immediately drops the red box again, watching it fall. The older child, having just returned the box, bends down once more to retrieve it. The toddler looks down at the older child, anticipating the pick-up.', 'dialogues': "Another clear 'thud' as the box lands. The toddler emits a louder, more excited giggle.", 'visual_environment': 'The supermarket aisle remains the same, with shelves of products visible. The bright fluorescent lighting illuminates the scene.', 'camera': 'The camera maintains its slightly low-angle position, with a slight, natural shake from being handheld, keeping the toddler and the immediate floor area in focus.', 'sounds': "The sound of the falling box and the toddler's infectious laughter are prominent. Background chatter and ambient supermarket noise continue.", 'moods': 'Joyful, repetitive, amusing.'}, {'timestamp': '3.5-5.5', 'duration_seconds': 2.0, 'scene_type': 'Continued playful loop with a smile', 'characters': 'The toddler is still in the cart, holding the box. The older child is now on the floor, having just picked up the box, and looks towards the camera with a bright smile.', 'actions': 'The older child, having just picked up the box, turns their head towards the camera and smiles broadly, acknowledging the playful situation. Shortly after, the toddler, with a look of pure delight, drops the red box for the third time.', 'dialogues': "The distinct 'thud' of the box is heard. The toddler's laughter is continuous and infectious. An adult voice can be heard saying 'Oh my god'.", 'visual_environment': "The supermarket setting is consistent, with shelves of products and the tiled floor. The lighting remains bright. The older child's direct gaze and smile towards the camera add a personal touch and are a key visual cue of the positive mood.", 'camera': 'The camera remains mostly static and low-angled, capturing the full interaction.', 'sounds': "The sound of the falling box and the toddler's joyful laughter dominate the audio.", 'moods': 'Happy, engaging, lighthearted.'}, {'timestamp': '5.5-7.5', 'duration_seconds': 2.0, 'scene_type': 'Final drop and triumphant laughter', 'characters': 'The toddler is still in the shopping cart, now laughing heartily. The older child is seen picking up the box one last time.', 'actions': 'The toddler drops the red box for the fourth time, this time with an even more pronounced, open-mouthed laugh. The older child bends down to pick it up. The toddler looks directly at the camera, beaming with pure joy.', 'dialogues': "The final 'thud' of the box. The toddler's loud, uninhibited, and infectious laughter fills the audio, clearly the highlight of the scene.", 'visual_environment': "The supermarket aisle, shelves, and lighting are unchanged. The focus remains on the children's interaction.", 'camera': "The camera maintains its mostly static, low-angle shot, perfectly framing the toddler's joyful expression and the repetitive action. The shot ends with the toddler's face full of laughter.", 'sounds': "The most prominent sound is the toddler's loud, joyful laughter, which is highly contagious. The 'thud' of the box is also present.", 'moods': 'Exuberant, joyful, triumphant, highly amusing.'}] **Specific Directives for Video Generation:** **[5.5s–7.5s]**: - **Scene type**: Final drop and triumphant laughter - **Camera Focus**: Maintains its mostly static, low-angle shot, perfectly framing the toddler's joyful expression and the repetitive action. The shot ends with the toddler's face full of laughter. **General Constraints for Video Output**: - The video must be non-cartoon obeying real-world physics, unless the User Prompt explicitly specifies it is cartoon/animated. - Do **not** generate the video with multiple scene transitions, unless the Video Details have multiple transitions. - Do **not** include any captions or subtitles. - Do **not** generate any human voice-over unless the overall message explicitly requires. - The video must begin smoothly—no abrupt or jarring openings. - The ending must be complete and polished, with all character dialogue and actions fully delivered. Avoid any sudden or unfinished cuts.

Direct Prompting (Original Prompt): A couple runs through a sudden downpour, laughing and splashing in puddles as they try to find shelter.

VISTA (Ours): Produce an 8-second live-action video depicting the scenario: "A couple runs through a sudden downpour, laughing and splashing in puddles as they try to find shelter."...

Citation

If you find our work useful, please consider citing:

@article{long2025vista,
  title={VISTA: A Test-Time Self-Improving Video Generation Agent},
  author={Long, Do Xuan and Wan, Xingchen and Nakhost, Hootan and Lee, Chen-Yu and Pfister, Tomas and Arik, Sercan O},
  journal={arXiv preprint arXiv:2510.15831},
  year={2025},
  url={https://arxiv.org/abs/2510.15831}
}

We acknowledge some of the prompts above being from MovieGenVideo (Meta).