CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

1KAIST    2Kyung Hee University    3AITRICS
*Equal contribution.   Corresponding author.
CollabVR concept: VLM as planner, VGM as simulator, in a closed loop.

VLM as planner, VGM as simulator. A VLM is strong at reasoning but weak at visual simulation; a VGM simulates short clips but lacks reasoning, which causes long-horizon drift and mid-clip simulation errors. CollabVR couples them in a closed loop where the VLM plans the immediate next action, inspects each generated clip, and routes test-time compute across qualitatively distinct recovery strategies (re-generation, action splitting) matched to the diagnosed failure.

Highlights

  • Closed-loop, step-level coupling. The VLM plans one action at a time, watches the clip the VGM produces, and decides whether to regenerate or split — replacing both upfront plans and post-hoc critiques.
  • Consistent gains across open- and closed-source VGMs. On Gen-ViRe, CollabVR lifts VBVR-Wan2.2 from 0.391 → 0.531 (+0.140) and Veo 3.1 from 0.481 → 0.550 (+0.069), with the largest improvements on long-horizon Planning and Algorithmic tasks.
  • Better cost–quality frontier. Rather than throwing more samples at the same prompt, CollabVR routes compute across qualitatively distinct recovery strategies (verifier-guided re-generation, action splitting) — reaching a strictly better cost–quality point than Pass@k and prior video test-time scaling baselines on both VGMs.

Abstract

Recent Thinking with Video approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift when a single prompt specifies a multi-step task, and mid-clip simulation errors that propagate through subsequent frames. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose CollabVR, a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and routes test-time compute across qualitatively distinct recovery strategies (re-generation, action splitting) matched to the diagnosed failure. On Gen-ViRe and VBVR-Bench, CollabVR consistently improves both open-source and closed-source VGMs over single-inference, Pass@k, and prior video test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning.

Method

CollabVR overall architecture.

Closed-loop architecture. At each step, the VLM observes the current frame and decides one immediate action; the VGM renders a short clip realizing that action; the VLM then verifies whether the resulting last frame matches the planned target. Verification outcomes route compute across two complementary modules: M1 (progressive planning) chooses the per-state action granularity by adaptively selecting the number of sub-steps N, and M2 (verification + re-generation) rejects mid-state deviations and regenerates the offending clip up to a budget M. The two modules address different failure modes — long-horizon drift versus mid-clip error — and are activated independently per state.

Algorithm: CollabVR procedure.

Algorithm. At each planning step t, the VLM emits an action at conditioned on the current frame, task prompt, and history. The VGM renders a clip ct, which the verifier accepts or rejects with a diagnosed failure mode d. On reject, the action is updated by prompt evolution and the clip is re-sampled, up to M attempts; if all M attempts fail, the framework routes to a recovery strategy chosen from d. On accept, the clip is appended and the loop continues until task completion or the cap Nmax. Both N and the recovery path are decided online, per state.

Pre-planning vs. progressive planning, with cost-performance plot.

Pre-planning vs. progressive planning. A natural extension of Chain-of-Thought to video is pre-planning: the VLM decomposes the task into N milestone prompts upfront and the VGM generates one clip per milestone. Pre-planning commits the full plan before any frame exists, so it cannot adapt to what the generator actually produces, and N itself is hard to fix from the prompt alone. CollabVR's progressive planning instead emits one action at a time and inspects the realized clip before continuing, so both later sub-steps and N adapt to the VGM's output. At matched cost, progressive planning yields a +13% relative gain over pre-planning on Gen-ViRe with VBVR-Wan2.2 (Module 1 only).

Main Results

We evaluate CollabVR on two benchmarks: Gen-ViRe (six-category visual reasoning over goal-directed video tasks) and VBVR-Bench. Across open-source (VBVR-Wan2.2, Cosmos-Predict2.5) and closed-source (Veo 3.1) VGMs, CollabVR delivers consistent gains over single-inference, Pass@k, and prior video test-time scaling baselines at matched compute.

Performance-cost trade-off on Gen-ViRe.

Performance–cost trade-off on Gen-ViRe. Pass@k resampling plateaus quickly with cost, and prior video test-time scaling (VideoTPO) trades extra budget for modest improvement. CollabVR reaches a markedly higher score at lower budget on both VBVR-Wan2.2 and Veo 3.1, sitting on a strictly better point on the cost–quality frontier.

Table 1: Gen-ViRe main results.

Gen-ViRe (Table 1). CollabVR improves both VBVR-Wan2.2 (0.391 → 0.531, +0.140) and Veo 3.1 (0.481 → 0.550, +0.069) over the single-inference Pass@1 baseline. Gains are most pronounced on long-horizon categories (Planning, Algorithmic), where the closed-loop M1+M2 routing turns failures into correctable signals rather than letting them cascade.

Table 2: VBVR-Bench main results.

VBVR-Bench (Table 2). On VBVR-Bench, CollabVR consistently outperforms baselines on open-source VGMs (VBVR-Wan2.2, Cosmos-Predict2.5), with the largest gains on categories that require multi-step spatial and transformation reasoning.

Qualitative Comparisons

Pre-planning vs. Progressive Planning (Ours)

A direct comparison of pre-planning (the VLM commits to N milestone prompts upfront) versus CollabVR's progressive planning (the VLM emits one action at a time and adapts N to what the VGM produces). Same task, same VGM, matched cost.

"The scene shows a network of nodes connected by directed edges (edges with arrows indicating direction) with a green starting node, a red ending node, and a blue triangular agent positioned at the green starting node. The agent can only move along edges in the direction they point, moving from one node to an adjacent node each step. Move the blue triangular agent from the green starting node to the red ending node along the path with the minimum number of steps."

Pre-planning

Progressive planning (Ours)

BibTeX

@article{kim2026collabvr,
  title   = {CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models},
  author  = {Kim, Joowon and Shin, Seungho and Park, Joonhyung and Yang, Eunho},
  journal = {arXiv preprint},
  year    = {2026}
}