MosaicMem

TL;DR MosaicMem is a hybrid spatial memory for video world models that bridges explicit 3D memory and implicit latent frames. It retrieves spatially aligned 3D patches to preserve persistent scene structure, improving camera consistency while supporting dynamic scene modeling, long-horizon navigation, and memory-based editing.

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory is still a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle with depicting moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting model’s native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while letting the model inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence versus implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing and autoregressive rollout.

Mosaic Memory

MosaicMem lifts video patches into 3D space and gathers them in the target viewpoint, stitching them together like a mosaic. Our model simultaneously supports text-driven dynamics and free camera navigation. Navigation is jointly controlled by MosaicMem retrieval and PRoPE conditioning. The retrieved mosaic patches are flattened and concatenated with the token sequence as conditioning inputs, while residual alignment errors are corrected through warping.

Promptable World Events

Similar to Genie3, our system supports promptable world events. These events allow users to dynamically modify the generated world, thereby enriching the interactive experience beyond simple navigation control.

Long-Horizon Navigation

MosaicMem can generate explorable visual experiences up to 2 minutes long that remain physically consistent over long horizons. While preserving accurate memory, it maintains precise egomotion and enables rich prompt-driven dynamics. Videos are played at 4x speed for better visualization.

Memory Manipulation

We can extract spatial memory from different scenes, perform stitching and spatial transformations such as translations, and synthesize scenes that would be impossible to exist in the real world.

The scenes on the left and right come from different spatial memories—some are virtual, some are real. The agent can freely explore within the synthesized imaginary environment.

The scenes on the top and bottom come from different spatial memories. This allows us to generate impossible, Inception-like environments that could not exist in reality.

Dynamic Scene Generation

Compared with explicit-memory baselines (represented here by GEN3C), MosaicMem can depict rich prompt-driven dynamics. In the examples below, the first half of each clip is generated by GEN3C, and the second half is continued by MosaicMem.

Precise Camera Controll

Compared with implicit-memory baselines (represented here by Context-as-Memory), MosaicMem can follow user-specified camera motion instructions almost perfectly.

Citation

@article{mosaicmem2026,
                title={MosaicMem: Hybrid Spatial Memory for Controllable Video World Models}, 
                author={Wei Yu and Runjia Qian and Yumeng Li and Liquan Wang and Songheng Yin and Sri Siddarth Chakaravarthy P and Dennis Anthony and Yang Ye and Yidi Li and Weiwei Wan and Animesh Garg},
                year={2026},
                eprint={2603.17117},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2603.17117}, 
            }

MosaicMem

Hybrid Spatial Memory for Controllable Video World Models

Abstract

Mosaic Memory

Promptable World Events

Long-Horizon Navigation

Memory Manipulation

Dynamic Scene Generation

Precise Camera Controll

Long-term Spatial Memory Comparison

w/o Long-term Spatial Memory

w/ Long-term Spatial Memory

Spatio-Temporal Memory Balance Comparison

Over-reliance on Spatial Memory

Balanced Spatio-Temporal Memory

Method

Qualitative Comparison

Long-term Memory Performance

Citation