Thinking in Dynamics How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Teaser Image
Yuzhi Huang*♠1   Kairun Wen*1   Rongxin Gao*1   Dongxuan Liu1   Yibin Lou3   Jie Wu2   Jing Xu7 Jian Zhang1   Zheng Yang1   Yunlong Lin1   Chenxin Li4   Panwang Pan1   Junbin Lu5   Jingyan Jiang6 Xinghao Ding1   Yue Huang† 1   Zhi Wang2
1XMU    2THU    3SUSTech    4CUHK    5UW    6SZTU    7JNU
*Equal Contribution     Corresponding author     Project lead
🌟 CVPR 2026 🌟
Spatio-temporal dynamics reasoning benchmark Dyn-Bench
Figure 1: Spatio-temporal dynamics reasoning benchmark Dyn-Bench. It rigorously evaluates multimodal large language models on their ability to perceive, track, and reason about dynamic contents (i.e., object motion, evolving scene, camera motion) in the 4D world.

Highlight

  1. Novel 4D Dynamic Perception Dimension: Beyond static visual understanding, we systematically propose “Thinking in Dynamics,” requiring models to perceive, track, and reason about spatio-temporal dynamics in evolving scenes.
  2. Large-Scale Benchmark — Dyn-Bench: We develop Dyn-Bench, a scalable evaluation platform integrating diverse real-world and synthetic 4D video data, comprising 1k videos, 7k VQA pairs, and 3k dynamic object grounding pairs.
  3. Spatio-Temporal Inconsistencies in Existing Models: In-depth probing of general, spatial, and region-level MLLMs reveals that models struggle to simultaneously handle spatio-temporal reasoning and dynamic object grounding, often producing logically inconsistent interpretations.
  4. Effective Structured Enhancement Strategies: Conventional prompting strategies (e.g., CoT) yield limited gains; we propose Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM) to significantly enhance dynamics perception and spatio-temporal reasoning.

Click to jump to each section.


Demo Video

    [Click to reveal answer]

    Dyn-Bench

    Benchmark Overview

    We present Dyn-Bench, a large-scale benchmark for quantitatively evaluating the spatio-temporal reasoning abilities of MLLMs under fine-grained dynamic-object understanding. Dyn-Bench consists of 1k dynamic video scenes with 7k visual question answering pairs and 3k grounding annotations, collected from four 2D video segmentation and four 4D dynamic scene datasets spanning diverse environments, motion patterns, and camera trajectories. As illustrated in Fig. 2, the benchmark is structured into three complementary levels: Dynamic Inter-Object Perception, Dynamic Object-Scene Tracking, and Dynamic Camera-Object Reasoning, each integrating spatio-temporal reasoning and dynamic object grounding tasks. An overview of dataset statistics is provided in Fig. 3.

    benchmark category
    Figure 2: Benchmark curation pipeline. The pipeline integrates dynamic video datasets from multiple sources, followed by multimodal completion with geometry and mask extraction. Data filter strategy ensures consistency and motion stability, complemented by human validation for quality assurance. Finally, spatial-temporal VQA and grounding pairs are generated with structured textual cognitive maps.
    benchmark category
    Figure 3: Benchmark Statistics. Top: Distribution of tasks across three levels. Bottom: VQA pairs distribution across datasets.

    Dyn-Bench Construction

    We develop a robust pipeline to construct Dyn-Bench.

    1. Data Collection and Filtering: We construct Dyn-Bench by collecting dynamic videos from four 2D video segmentation datasets (DAVIS, SA-V, DynPose-100K, and YouTube-VIS) and four 4D dynamic-scene datasets (DynamicReplica, PointOdyssey, Spring, and Total-Recon). These datasets provide instance masks, depth maps, and camera poses, enabling accurate question-answer generation and object category annotation. Missing annotations are completed using existing pipelines to ensure cross-modal consistency. To ensure data reliability, we employ a multi-criteria data filter strategy assessing geometric stability, motion smoothness, image sharpness, and depth consistency, supported by VLM-based quality evaluation. Low-quality videos are discarded to maintain visual and geometric fidelity. Filtering statistics are shown in Tab. 1.

    2. Question-Answer Generation: Based on the filtered video collection, we employ a ST-TCM in conjunction with Qwen3-VL to construct dynamic-object-centered VQA tasks. The benchmark evaluates MLLMs across three complementary dimensions: Dynamic Inter-Object Perception, focusing on how models perceive and interpret motion interactions and spatial relations among multiple dynamic objects (e.g., approach, occlusion, or overtaking); Dynamic Object-Scene Tracking, capturing how individual objects are temporally tracked and evolve within continuously changing scenes (e.g., entering, leaving, or undergoing functional transitions); and Dynamic Camera-Object Reasoning, assessing how camera motion influences the perceived geometry, depth, and temporal consistency of dynamic objects (e.g., relative translation, rotation, or event order). Each VQA dimension is paired with a corresponding object grounding task that associates the referenced dynamic objects with their instance segmentation masks. Dimension-specific prompting strategies and ST-TCM configurations are applied to Qwen3-VL to ensure focused spatio-temporal reasoning.

    3. Spatio-Temporal Textual Cognitive Map Construction: To capture fine-grained object motion and interactions in dynamic scenes, we construct a Spatio-Temporal Textual Cognitive Map (ST-TCM) for each filtered video. Given per-frame RGB-D inputs and segmentation masks, 3D object trajectories are reconstructed to obtain geometric attributes such as position, size, and orientation in world coordinates. We then model inter-object and camera-object relations based on spatial proximity and motion continuity, capturing dynamic behaviors such as interaction and relative movement. All geometric and spatial cues are translated into textual descriptions through a rule-based template system, integrating object geometry, motion, and relational dynamics into a unified spatio-temporal representation. This structured textual form serves as input to Qwen3-VL-235B for dynamic object centered visual question answering and grounding. Detailed implementation procedures are provided in the supplementary material.

    4. Human Quality Control: To ensure the reliability and perceptual validity of the filtered videos and generated annotations, we conduct an additional round of human verification covering video quality, mask consistency, VQA accuracy, and dynamic object category identification. Annotators assess camera stability, motion smoothness, and scene complexity to confirm visual quality, and examine segmentation masks to verify temporal coherence and consistent object identity across frames. The generated VQA and grounding pairs are also reviewed to ensure accurate object reference, reasoning-level alignment, and consistency with visual evidence. A summary of the multi-stage filtering and verification process is presented in Tab. 1.

    benchmark category

    Table 1. Dataset statistics across Dyn-Bench filtering stages.

    Evaluation on Dyn-Bench

    Evaluation Setup

    Benchmark Models. We evaluate three categories of MLLMs: general, spatial, and region-level models. General MLLMs (e.g., GPT-4o, Qwen3-VL) and spatial MLLMs (SpaceR, VST, Spatial-SSRL, SpatialLadder) lack explicit dynamic object grounding and are therefore evaluated only on spatio-temporal reasoning. In contrast, region-level MLLMs (Sa2VA, UniPixel, VideoGLaMM) are assessed on both spatio-temporal reasoning and dynamic object grounding to measure their fine-grained dynamic understanding.

    To benchmark performance on spatio-temporal reasoning, we include two chance baselines: Chance Level (Random), obtained by uniformly sampling answers for multiple-choice questions, and Chance Level (Frequency), obtained by predicting the most frequent answer per task. All models are evaluated in a zero-shot setting using their default instruction templates to ensure consistent and fair comparison.

    Metric Design. For the three VQA task levels, we adopt a multiple-choice answering format and use Accuracy (ACC) as the primary evaluation metric, following standard practice. ACC is computed by exact matching over the model's selected multiple-choice option.

    For the corresponding object grounding tasks, we employ the video object segmentation metric ℐ&ℱ, which averages region similarity (ℐ) and boundary accuracy (ℱ).

    Main Results

    Tab. 2 and Tab. 3 report the overall performance on Dyn-Bench, and corresponding radar results are shown across the three VQA levels. Our findings are summarized as follows:

    General MLLM. Proprietary models such as GPT-4o and GPT-5 maintain strong performance in spatio-temporal reasoning, particularly excelling in inter-object understanding through accurate modeling of actions and relational dynamics. In contrast, open-source models have rapidly narrowed the gap, with large-scale systems like Qwen3-VL-235B achieving comparable or even superior overall results, and smaller variants such as LLaVA-OneVision-1.5-8B and Qwen3-VL-32B delivering competitive accuracy despite reduced parameter counts. Overall, proprietary models tend to dominate relational and motion-oriented reasoning, while open-source models demonstrate more balanced generalization across object- and scene-level understanding.

    Spatial MLLMs. Compared with general MLLMs, spatial models exhibit stronger performance on geometry-dependent object–scene reasoning, highlighting the value of explicit spatial priors. Within this category, VST-7B-RL attains the strongest overall performance, with SpaceR-7B and SpatialLadder-3B following closely. However, despite their strengths in static and relational spatial tasks, spatial MLLMs remain weaker than both general and region-level models on camera–object interaction and motion-centric reasoning, indicating that spatial priors alone are insufficient for modeling dynamic 4D scenes.

    Region-level MLLMs. Models in this category deliver the strongest performance on object-centric spatio-temporal reasoning and dynamic object grounding, enabled by their integration of fine-grained regional cues and localized feature alignment. UniPixel-7B provides the best overall spatio-temporal reasoning within this group, while Sa2VA-based variants achieve the highest grounding accuracy across diverse dynamic settings. Relative to both general and spatial MLLMs, these models exhibit pronounced advantages in motion understanding and relational dynamics, indicating that region-level grounding supplies robust structural priors that enhance temporal coherence and support more reliable interpretation of complex dynamic scenes.



    Act. & Obj. Desc. Move. & Temp. Dyn. Spatial Rel. & Change Mov. Patterns & Traj. Spatial Rel. & Comp. Scene Focus & Dyn. Cam. Motion & Orient. Cam-Obj. Interaction Temp. & Visual Change
    Methods Rank Avg. Inter-Object Object-Scene Camera-Object
    Baseline
    Chance Level (Random) - 25.025.025.0 25.025.025.0 25.025.025.0
    Chance Level (Frequency) - 12.321.631.5 29.127.713.8 33.510.125.6
    Proprietary Models (API)
    GPT-4o 3 50.1 56.138.744.6 63.159.168.8 47.242.049.2
    GPT-5 2 59.5 68.647.348.1 71.765.973.0 60.958.458.4
    Gemini-2.5 Pro 1 59.8 69.748.050.5 67.859.965.6 60.754.951.8
    Open-source Models
    InternVL3-14B 7 53.7 65.347.049.7 67.869.277.7 37.944.946.9
    InternVL3-38B 5 54.2 68.244.848.4 71.167.376.7 41.248.944.6
    InternVL3.5-8B 11 50.3 66.641.044.8 63.562.769.6 41.540.146.4
    InternVL3.5-38B 10 50.8 65.040.742.4 60.262.372.3 45.542.944.6
    Qwen2.5-VL-7B 9 51.6 61.742.848.0 69.167.073.7 43.839.642.8
    Qwen2.5-VL-32B 4 56.0 71.552.253.8 71.567.675.3 42.146.747.9
    Qwen2.5-VL-72B 8 51.8 65.541.143.7 60.457.868.3 49.242.655.0
    Qwen3-VL-8B 3 61.4 70.852.653.6 75.071.282.4 55.452.660.0
    Qwen3-VL-32B 2 62.7 73.756.253.4 74.673.180.2 58.254.356.9
    Qwen3-VL-235B 1 65.3 76.455.855.6 77.876.184.1 59.859.060.2
    LLaVA-OneVision-1.5-4B 12 49.9 50.548.550.3 65.264.863.9 39.936.646.1
    LLaVA-OneVision-1.5-8B 6 53.8 60.947.753.4 74.469.675.4 41.037.051.6
    Spatial MLLMs
    SpaceR-7B 1 56.5 66.649.252.7 72.267.878.2 50.340.055.5
    VST-7B-RL 2 55.7 68.648.451.9 73.070.779.4 45.139.152.9
    Spatial-SSRL-7B 4 45.9 54.540.048.1 68.565.973.8 35.836.737.7
    SpatialLadder-3B 3 53.6 60.846.149.2 70.070.977.1 38.242.051.9
    Region-level MLLMs
    UniPixel-3B 2 55.4 63.347.253.2 71.770.277.7 43.243.652.0
    UniPixel-7B 1 58.1 64.450.254.7 76.170.479.7 47.347.355.7
    VideoGLaMM 7 30.7 35.634.435.0 34.638.239.3 22.721.225.9
    Sa2VA-InternVL2.5-8B 6 49.4 61.042.445.7 66.162.871.9 36.636.447.2
    Sa2VA-InternVL3-14B 3 53.6 55.948.953.2 72.070.274.6 38.139.653.6
    Sa2VA-Qwen2.5-VL-7B 4 50.3 58.639.352.9 67.662.170.5 38.839.149.3
    Sa2VA-Qwen3-VL-4B 5 49.8 60.839.346.2 67.262.073.2 41.044.536.8
    Table 2. Spatio-temporal reasoning evaluation on Dyn-Bench. Top three performers in each task category are highlighted from Dark (highest) to Light (third highest), and overall model rankings are ranging from Dark (highest) to Light (third highest).
    Models Average
    Inter-Object
    Object-Scene
    Camera-Object
    \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\) \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\) \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\) \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\)
    Region-level MLLMs
    UniPixel-3B 40.641.240.9 37.738.338.1 45.846.646.2 38.238.838.5
    UniPixel-7B 64.466.065.2 65.466.666.0 70.172.071.1 57.859.458.6
    VideoGLaMM 55.463.859.6 54.863.058.9 61.469.865.6 49.958.754.3
    Sa2VA-InternVL2.5-8B 74.277.175.6 75.477.876.8 78.681.980.2 68.571.770.1
    Sa2VA-InternVL3-14B 70.574.172.2 72.776.174.4 74.277.976.0 64.568.266.3
    Sa2VA-Qwen2.5-VL-7B 71.174.572.8 71.974.973.4 74.177.775.9 67.370.869.1
    Sa2VA-Qwen3-VL-4B 66.870.468.6 66.570.068.3 73.477.075.2 60.464.162.2
    Table 3. Dynamic object grounding evaluation of Region-level MLLMs on Dyn-Bench. Top three performers in each task category are highlighted from Dark (highest) to Light (third highest).

    How MLLMs Think in Dynamics Textually

    To investigate how MLLMs think in dynamics textually, we first analyze GPT-4o's self-explanations on failure cases from Dyn-Bench to examine its Chain-of-Thought (CoT) behavior in dynamic settings. We then introduce a spatio-temporal textual context module (ST-TCM) as an auxiliary input to qualitatively assess its effect on reasoning. Finally, we conduct an ablation study on three key components, namely temporal semantics, spatial geometry, and motion dynamics, to identify which factors most effectively enhance spatio-temporal reasoning.

    Self-Explanations in Dynamic Reasoning

    Self-explanation refers to an MLLM's ability to articulate intermediate reasoning while generating responses. We examine GPT-4o's self-explanations on Dyn-Bench and present representative success and failure cases in Fig. 4 to illustrate its reasoning strengths and limitations.

    Case Studies. Fig. 4 compares GPT-4o's self-explanations in a failure and a success case. In the failure example (Fig. 4, left), the model generates linguistically fluent but physically inconsistent reasoning. When estimating how a white car catches up with a gray-clothed pedestrian, it relies on visual heuristics such as apparent size change across frames rather than metric reasoning, resulting in inaccurate temporal estimation. This reflects a gap between linguistic coherence and physical grounding. In contrast, the success case (Fig. 4, right) demonstrates structured reasoning that integrates motion and relational cues to estimate relative velocities and produce a temporally consistent prediction. These findings suggest that incorporating explicit spatio-temporal cues enables more coherent and causally grounded reasoning about dynamic events.

    Error Analysis. GPT-4o's errors in dynamic reasoning can be broadly categorized into three fundamental types:

    1. Temporal reasoning errors, where the model fails to maintain event order or motion continuity, interpreting sequences as discrete frames rather than continuous processes;
    2. Spatial grounding errors, arising from limited geometric understanding and resulting in inaccurate distance or position estimation;
    3. Relational reasoning errors, reflecting persistent difficulty in capturing causal or interactional dependencies between objects.

    These errors indicate that GPT-4o lacks structured temporal, spatial, and relational representations, ultimately constraining its ability to reason about motion in a physically coherent manner.

    faliurecase
    Figure 4: Examples of how an MLLM thinks in dynamics as revealed by self-explanations. The model demonstrates strong chain-of-thought reasoning and semantic understanding but lacks 3D scene modeling and time awareness without explicit spatio-temporal guidance.

    Textual Guidance in Dynamics

    To further investigate how ST-TCM enhances spatio-temporal reasoning and dynamic object grounding, we conduct an ablation over its three components: temporal semantics (T), motion dynamics (M), and spatial geometry (S). We evaluate two representative models under distinct paradigms: the general MLLM Qwen3-VL-8B for spatio-temporal reasoning and the region-level MLLM UniPixel-3B for dynamic object grounding.

    As shown in Tab. 4, incorporating ST-TCM components consistently improves both models, though their optimal configurations differ. For Qwen3-VL-8B, motion and spatial cues (M+S) produce the largest gains, highlighting the role of object movement and geometric structure in achieving stable temporal reasoning, whereas temporal cues alone are insufficient. For UniPixel-3B, motion cues offer the primary improvement, and spatial cues mainly refine object–trajectory alignment. The M-only configuration attains the best overall performance.

    Configuration Inter-Object Object-Scene Camera-Object Avg Inter-Object Object-Scene Camera-Object
    \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\) \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\) \(\mathcal{J}\) \(\mathcal{F}\) \(\mathcal{J}\)&\(\mathcal{F}\)
    Qwen3-VL-32B Sa2VA-InternVL2.5-8B
    w/o TCM 59.076.756.262.8 74.876.675.2 78.481.780.0 70.373.571.9
    w/ T only 59.376.456.362.9 76.679.378.0 79.782.981.3 73.076.174.6
    w/ M only 64.377.153.563.3 76.879.678.2 79.783.081.4 73.877.375.5
    w/ S only 66.178.760.167.2 76.979.778.3 79.983.281.5 74.878.576.4
    w/ T + M 63.876.754.063.3 77.079.878.4 79.883.181.4 73.877.375.5
    w/ T + S 67.078.559.667.1 76.979.778.3 80.083.381.8 74.978.676.7
    w/ M + S 68.478.859.467.5 77.179.978.5 80.183.581.6 75.378.977.1
    w/ T + M + S 69.279.160.568.3 77.380.278.8 80.283.681.9 75.479.177.3
    Table 4. Ablation of Spatio-Temporal Textual Cognitive Map. Top three performers in each column are highlighted from Dark (highest) to Light (third highest). T, M, and S denote temporal semantics, motion dynamics, and spatial geometry, respectively.


    How MLLMs Think in Dynamics Visually

    To examine how MLLMs think in dynamics visually, we conduct qualitative and quantitative analyses to study how explicit visual guidance affects motion understanding. As shown in Fig. 5, we design two input strategies to guide model attention toward dynamic regions.

    1. Masked Frames Only overlays object segmentation masks on each frame, directing attention to moving entities while maintaining temporal continuity.
    2. Mask-Guided Fusion combines raw frames with their corresponding masks, integrating complementary cues from appearance and motion.

    These strategies explicitly ground visual perception in motion-centric regions, enhancing spatio-temporal alignment and relational reasoning. We evaluate Qwen3-VL-8B under these configurations, using the Raw Video setting as baseline.

    visual_v3
    Figure 5: Mask-Guided Input Comparison.

    The results in Tab. 5 show that mask-guided visual grounding enhances the model's ability to capture dynamic object interactions and maintain temporal consistency. Masked Frames Only offers only minor gains, suggesting limited value from isolated localization cues. In contrast, the Mask-Guided Fusion setting improves all categories, with the largest gains in Inter-Object and Camera-Object reasoning, which demand fine-grained motion and relational understanding. These results show that integrating appearance and motion cues strengthens object grounding and yields more coherent spatio-temporal reasoning.

    Configuration Inter-Object Object-Scene Camera-Object Avg
    Raw Video 38.9 74.5 55.6 53.8
    Masked Frames Only 39.4 74.3 54.9 53.8
    Mask-Guided Fusion 41.8 77.0 60.0 57.1
    Table 5. Quantitative Comparison of Mask-Guided Inputs.


    Discussion and Future Work

    We study how MLLMs think in dynamics by introducing Dyn-Bench, a comprehensive benchmark designed to evaluate both object-level and scene-level spatio-temporal reasoning and grounding across three representative model categories: general, spatial, and region-level MLLMs. Through this dual textual–visual assessment, we provide a systematic examination of how models perceive, track, and interpret dynamic content in the physical 4D world, including their ability to capture motion patterns, maintain temporal consistency, and model multi-entity interactions.

    Our experiments show that the Spatio-Temporal Textual Cognitive Map substantially enhances temporal coherence and relational reasoning by imposing structured linguistic abstraction over dynamic events, while mask-guided visual grounding further strengthens motion perception, improves fine-grained object continuity, and mitigates temporal drift. Taken together, these findings suggest that reliable dynamic understanding in MLLMs emerges from a synergistic coupling between high-level temporal semantics and localized region-level grounding.

    Looking ahead, future spatio-temporal MLLMs should integrate dynamic-object perception and temporal reasoning more tightly, motivating unified architectures that jointly model motion dynamics, relational structure, and higher-level temporal cognition to achieve more coherent, physically grounded reasoning in complex and continuously evolving environments.



    BibTeX

            @misc{wen2025dynamicverse,
                title={DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling}, 
                author={Kairun Wen and Yuzhi Huang and Runyu Chen and Hui Zheng and Yunlong Lin and Panwang Pan and Chenxin Li and Wenyan Cong and Jian Zhang and Junbin Lu and Chenguo Lin and Dilin Wang and Zhicheng Yan and Hongyu Xu and Justin Theiss and Yue Huang and Xinghao Ding and Rakesh Ranjan and Zhiwen Fan},
                year={2025},
                eprint={2512.03000},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2512.03000}, 
            }