Jiahao Lu1,
Jiayi Xu1,
Wenbo Hu2†,
Ruijie Zhu2,
Chengfeng Zhao1,
Sai-Kit Yeung1,
Ying Shan2,
Yuan Liu1†
1 HKUST
2 ARC Lab, Tencent PCG
Track4World estimates dense 3D scene flow of every pixel between arbitrary frame pairs from a monocular video in a global feedforward manner, enabling efficient and dense 3D tracking of every pixel in the world-centric coordinate system.
Clone the repository with submodules to ensure all dependencies are included:
git clone --recursive https://github.com/TencentARC/Track4World.git
cd Track4WorldWe provide an installation script tested with CUDA 12.1 and Python 3.11.
# Create and activate environment
conda create -n track4world python=3.11
conda activate track4world
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install -r requirements.txtWe utilize several external repositories. Please run the following commands to set them up correctly:
# Install utils3d
git clone https://github.com/jiah-cloud/utils3d.git
# Setup Pi3 (Sparse checkout)
git clone --no-checkout https://github.com/yyfz/Pi3.git track4world/nets/external/pi3_repo
cd track4world/nets/external/pi3_repo
git sparse-checkout init
git sparse-checkout set pi3
git checkout main
find . -maxdepth 1 -type f -exec rm -f {} \;
mv pi3 ../pi3
cd ../../../..
# Setup Grounded-SAM-2
git clone https://github.com/IDEA-Research/Grounded-SAM-2.git submodules
cd submodules
pip install -e .
pip install --no-build-isolation -e grounding_dino
cd ..Download the pre-trained model weights and place them in the checkpoints/ directory.
mkdir -p checkpoints
# Download SAM2 weights
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt -O ./checkpoints/sam2.1_hiera_large.pt
# Download Track4World weights
wget https://huggingface.co/TencentARC/Track4World/resolve/main/track4world_da3.pth -O ./checkpoints/track4world_da3.pth
wget https://huggingface.co/TencentARC/Track4World/resolve/main/track4world_pi3.pth -O ./checkpoints/track4world_pi3.pth
wget https://huggingface.co/TencentARC/Track4World/resolve/main/track4world_moge.pth -O ./checkpoints/track4world_moge.pth- Manual Download: HuggingFace Link
Run the following commands to perform tracking and reconstruction on the provided demo video (demo_data/cat.mp4).
Reconstructs 3D motion based on the geometry of the first frame.
python demo.py \
--mp4_path demo_data/cat.mp4 \
--mode 3d_ff \
--Ts -1 \
--save_base_dir results/catPerforms dense 3D tracking for every pixel across all frames.
Option A: Camera-Centric Coordinate System
python demo.py \
--mp4_path demo_data/cat.mp4 \
--coordinate world_depthanythingv3 \
--mode 3d_efep \
--Ts -1 \
--ckpt_init checkpoints/track4world_da3.pth \
--save_base_dir results/catOption B: World-Centric Coordinate System
For world-centric reconstruction, you can also directly run Step 2 to obtain world-centric 3D tracking results. However, for better visualization, especially to clearly separate foreground and background objects,it is recommended to first segment dynamic objects using DINO and SAM2 in Step 1. You can use either world_depthanythingv3 or world_pi3 for world coordinate system.
# 1. DINO + SAM2 Segmentation
# Use --text-prompt to specify the dynamic objects in your video (e.g., "cat.", "person.", "car.").
python scripts/run_dino_sam2.py \
--video-path demo_data/cat.mp4 \
--sam2-checkpoint checkpoints/sam2.1_hiera_large.pt \
--output-dir results/cat \
--text-prompt "cat."
# 2. Run Track4World 3D EFEP
python demo.py \
--mp4_path demo_data/cat.mp4 \
--coordinate world_depthanythingv3 \
--mode 3d_efep \
--Ts -1 \
--ckpt_init checkpoints/track4world_da3.pth \
--save_base_dir results/catPerforms standard 2D tracking in image space.
python demo.py \
--mp4_path demo_data/cat.mp4 \
--mode 2d \
--Ts -1 \
--save_base_dir results/catVisualize the dense 4D trajectories and reconstructed scenes using the generated output files.
Visualize First Frame 3D Tracking:
python visualization/vis_3d_ff.py --ply_dir results/cat/3d_ff_outputVisualize Dense Tracking (Every Pixel):
# Camera Centric Visualization
python visualization/vis_3d_efep.py --ply_dir results/cat/3d_efep_output
# World Centric Visualization (Foreground-Background Separation, Static Background)
python visualization/vis_3d_efep_world.py --ply_dir results/cat/3d_efep_outputFor detailed instructions on how to evaluate the model on standard benchmarks (Sintel, KITTI, Kubric, etc.), please refer to the evaluation guide:
👉 Evaluation Guide (evaluation/eval.md)
If you find Track4World useful for your research or applications, please consider citing our paper:
@article{lu2026track4world,
title = {Track4World: Feedforward World-Centric Dense 3D Tracking of All Pixels},
author = {Jiahao Lu and Jiayi Xu and Wenbo Hu and Ruijie Zhu and Chengfeng Zhao and Sai-Kit Yeung and Ying Shan and Yuan Liu},
journal = {arXiv preprint arXiv:2603.02573},
year = {2026}
}Our codebase is built upon MoGe, Alltracker, Pi3, and Depth Anything 3. We also gratefully acknowledge TrackingWorld and VGGT for their excellent work!


