CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Joowon Kim¹^*, Seungho Shin²^*, Joonhyung Park¹, Eunho Yang^1,3^†

¹KAIST ²Kyung Hee University ³AITRICS

^*Equal contribution. ^†Corresponding author.

CollabVR concept: VLM as planner, VGM as simulator, in a closed loop.

VLM as planner, VGM as simulator. A VLM is strong at reasoning but weak at visual simulation; a VGM simulates short clips but lacks reasoning, which causes long-horizon drift and mid-clip simulation errors. CollabVR couples them in a closed loop where the VLM plans the immediate next action, inspects each generated clip, and routes test-time compute across qualitatively distinct recovery strategies (re-generation, action splitting) matched to the diagnosed failure.

Highlights

Closed-loop, step-level coupling. The VLM plans one action at a time, watches the clip the VGM produces, and decides whether to regenerate or split — replacing both upfront plans and post-hoc critiques.
Consistent gains across open- and closed-source VGMs. On Gen-ViRe, CollabVR lifts VBVR-Wan2.2 from 0.391 → 0.531 (+0.140) and Veo 3.1 from 0.481 → 0.550 (+0.069), with the largest improvements on long-horizon Planning and Algorithmic tasks.
Better cost–quality frontier. Rather than throwing more samples at the same prompt, CollabVR routes compute across qualitatively distinct recovery strategies (verifier-guided re-generation, action splitting) — reaching a strictly better cost–quality point than Pass@k and prior video test-time scaling baselines on both VGMs.

Abstract

Recent Thinking with Video approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift when a single prompt specifies a multi-step task, and mid-clip simulation errors that propagate through subsequent frames. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose CollabVR, a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and routes test-time compute across qualitatively distinct recovery strategies (re-generation, action splitting) matched to the diagnosed failure. On Gen-ViRe and VBVR-Bench, CollabVR consistently improves both open-source and closed-source VGMs over single-inference, Pass@k, and prior video test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning.

Method

Closed-loop architecture. At each step, the VLM observes the current frame and decides one immediate action; the VGM renders a short clip realizing that action; the VLM then verifies whether the resulting last frame matches the planned target. Verification outcomes route compute across two complementary modules: M1 (progressive planning) chooses the per-state action granularity by adaptively selecting the number of sub-steps N, and M2 (verification + re-generation) rejects mid-state deviations and regenerates the offending clip up to a budget M. The two modules address different failure modes — long-horizon drift versus mid-clip error — and are activated independently per state.

Algorithm. At each planning step t, the VLM emits an action a_t conditioned on the current frame, task prompt, and history. The VGM renders a clip c_t, which the verifier accepts or rejects with a diagnosed failure mode d. On reject, the action is updated by prompt evolution and the clip is re-sampled, up to M attempts; if all M attempts fail, the framework routes to a recovery strategy chosen from d. On accept, the clip is appended and the loop continues until task completion or the cap N_max. Both N and the recovery path are decided online, per state.

Pre-planning vs. progressive planning, with cost-performance plot.

Pre-planning vs. progressive planning. A natural extension of Chain-of-Thought to video is pre-planning: the VLM decomposes the task into N milestone prompts upfront and the VGM generates one clip per milestone. Pre-planning commits the full plan before any frame exists, so it cannot adapt to what the generator actually produces, and N itself is hard to fix from the prompt alone. CollabVR's progressive planning instead emits one action at a time and inspects the realized clip before continuing, so both later sub-steps and N adapt to the VGM's output. At matched cost, progressive planning yields a +13% relative gain over pre-planning on Gen-ViRe with VBVR-Wan2.2 (Module 1 only).

Main Results

We evaluate CollabVR on two benchmarks: Gen-ViRe (six-category visual reasoning over goal-directed video tasks) and VBVR-Bench. Across open-source (VBVR-Wan2.2, Cosmos-Predict2.5) and closed-source (Veo 3.1) VGMs, CollabVR delivers consistent gains over single-inference, Pass@k, and prior video test-time scaling baselines at matched compute.

Performance–cost trade-off on Gen-ViRe. Pass@k resampling plateaus quickly with cost, and prior video test-time scaling (VideoTPO) trades extra budget for modest improvement. CollabVR reaches a markedly higher score at lower budget on both VBVR-Wan2.2 and Veo 3.1, sitting on a strictly better point on the cost–quality frontier.

Gen-ViRe (Table 1). CollabVR improves both VBVR-Wan2.2 (0.391 → 0.531, +0.140) and Veo 3.1 (0.481 → 0.550, +0.069) over the single-inference Pass@1 baseline. Gains are most pronounced on long-horizon categories (Planning, Algorithmic), where the closed-loop M1+M2 routing turns failures into correctable signals rather than letting them cascade.

VBVR-Bench (Table 2). On VBVR-Bench, CollabVR consistently outperforms baselines on open-source VGMs (VBVR-Wan2.2, Cosmos-Predict2.5), with the largest gains on categories that require multi-step spatial and transformation reasoning.

Qualitative Comparisons

"The scene shows a 10x10 grid with a green start square (containing an orange circular agent), a red end square, and multiple blue rectangular blocks. Starting from the green start square, the agent can move to adjacent cells (up, down, left, right) each step. The goal is to move the agent to the red end square along the shortest path that passes through all blue blocks (the agent must visit every blue block before reaching the red end square)."

Baseline

+ CollabVR (Ours)

"A person opened a can of food. (Fixed camera angle)"

Baseline

+ CollabVR (Ours)

"Your task is to identify all hollow points. These are circles that are not filled in (white interior with black border). Circle each hollow point with a red ring."

Baseline

+ CollabVR (Ours)

"The scene shows a 10x10 grid with a green start point, a red end point, and yellow cells marked with numbers 1, 2, and 3. An orange circular agent is positioned at the green start point. The agent can move to adjacent cells (up, down, left, right). Starting from the green start point, the agent must visit the numbered yellow cells in numerical order (1, then 2, then 3), taking the shortest path between each consecutive pair of numbered cells. After visiting all numbered cells in sequence, the agent must reach the red end point, also following the shortest path."

Baseline

+ CollabVR (Ours)

"This is a first-person perspective. The task is to get a bottle of Coke from the refrigerator. The video must show the entire process of completing the task in a physically realistic and continuous sequence of actions."

Baseline

+ CollabVR (Ours)

"Move each animal face into its corresponding dark outline."

Baseline

+ CollabVR (Ours)

"A dashcam perspective of a vehicle in 'stop-and-go' highway traffic. The video shows the vehicle smoothly following the car in front (a white SUV). The lead SUV travels at a steady speed, then suddenly decelerates and comes to a complete stop."

Baseline

+ CollabVR (Ours)

"This is a Raven's Progressive Matrix reasoning task. Analyze the visual patterns across all rows and columns in this 3x3 grid. Deduce the logical rule that governs the entire matrix. Then, generate the single, correct shape in the empty bottom-right square that perfectly completes this pattern."

Baseline

+ CollabVR (Ours)

"Maintain a static view. Draw a line connecting point A to point D, and end the video once the line is fully drawn."

Baseline

+ CollabVR (Ours)

"Instantly reflect this pattern along the central vertical axis, while keeping the existing colored pattern unchanged. Static camera perspective, no zoom or pan."

Baseline

+ CollabVR (Ours)

"Identify and circle the asymmetrical shape among the displayed shapes. Only one shape lacks symmetry."

Baseline

+ CollabVR (Ours)

Pre-planning vs. Progressive Planning (Ours)

A direct comparison of pre-planning (the VLM commits to N milestone prompts upfront) versus CollabVR's progressive planning (the VLM emits one action at a time and adapts N to what the VGM produces). Same task, same VGM, matched cost.

"The scene shows a network of nodes connected by directed edges (edges with arrows indicating direction) with a green starting node, a red ending node, and a blue triangular agent positioned at the green starting node. The agent can only move along edges in the direction they point, moving from one node to an adjacent node each step. Move the blue triangular agent from the green starting node to the red ending node along the path with the minimum number of steps."

Pre-planning

Progressive planning (Ours)

BibTeX

@article{kim2026collabvr,
  title   = {CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models},
  author  = {Kim, Joowon and Shin, Seungho and Park, Joonhyung and Yang, Eunho},
  journal = {arXiv preprint},
  year    = {2026}
}