Video Generation Lab

Advanced Video Generation with High-quality, Faithfulness and Physical Intelligence

We are redefining the boundaries of video synthesis by bridging the gap between complex human intent and physical consistency. Our research focuses on interpreting multi-modal, highly sophisticated instructions to generate ultra-high-definition visual content. By embedding causal logic and world physics into the generative process, we ensure every frame is not just visually stunning, but inherently faithful to the laws of the real world.

Concept visual for video generation with a flowing film strip

Research Papers

Track 01

High-Fidelity Visual Synthesis

Synthesize cinematic-quality videos with exceptional clarity, temporal stability, and ultra-high resolution.

Track 02

Complex Intent Interpretation

Decode intricate human intent from multi-modal and sophisticated instructions into precise generative guidance.

Track 03

Physical and Causal Consistency

Embed structural world physics and causal logic to ensure every dynamic scene respects real-world laws.

2025 Any-condition control

Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation

We propose Any2Caption, a novel framework for controllable video generation from any condition by leveraging MLLMs to interpret diverse inputs into dense, structured captions.

Decouples condition interpretation from downstream video synthesis.
Supports text, image, video, region, motion, and camera controls in one framework.
Plug-and-play compatibility with various off-the-shelf SOTA video generators without fine-tuning.
Superior multi-condition reasoning that enhances both generation fidelity and physical consistency.
Introduces Any2CapIns, a large-scale instruction-tuning dataset for any-condition captioning.

Project Page Paper

2025 Instruction reasoning

ReaDe: A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators.

Uses a reason-then-describe pipeline to resolve ambiguity before generation.
Combines reasoning-augmented supervision with reward-based refinement.
Improves instruction fidelity across single-condition, multi-condition, and unseen inputs.

Project Page Paper

Submitted Benchmark + enhancement

Vid-PRE: Video Prompt Reasoner and Enhancer

From Evaluation to Enhancement: Benchmarking and Improving Think-with-Video Reasoning for Video Generative Models

VWG-Bench diagnoses whether video generators truly reason under rules and goals, and Vid-PRE improves this capability through a model-agnostic prompt rewriter trained with text-only rewards.

Builds VWG-Bench to evaluate reasoning-heavy video generation across 9 dimensions and 38 tasks.
Separates fluency, rule adherence, and goal realization with a three-level judge protocol.
Uses Vid-PRE to rewrite prompts with text-only rewards, improving reasoning without changing generators.

Related Ongoing Direction

Overview figure for Vid-PRE and VWG-Bench

Survey

Coming soon Vid-Gen Survey

A forthcoming synthesis of unified video foundation models for comprehension and generation, spanning taxonomy, method landscape, benchmarks, and open challenges.

Survey paper In preparation

Unified video foundation models for comprehension and generation

Taxonomy Method landscape Benchmarks Open problems

Link will be added when available