Skip to content

kdh8156/EgoX-EgoPriorRenderer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EgoX EgoPrior Rendering from ViPE Results

This codebase provides tools to generate ego prior videos for EgoX. For the EgoX model itself, please refer to the EgoX GitHub repository.

ViPE provides point cloud rendering functionality to visualize the 3D reconstruction results. This is particularly useful for analyzing the spatial structure and quality of the estimated depth maps and camera poses.

πŸ‘€ Installation

To ensure the reproducibility, we recommend creating the runtime environment using conda.

# Create a new conda environment and install 3rd-party dependencies
conda env create -f envs/base.yml
conda activate egox-egoprior
pip install -r envs/requirements.txt
pip install "git+https://github.com/facebookresearch/[email protected]" --no-build-isolation
pip install git+https://github.com/microsoft/MoGe.git

# Build the project and install it into the current environment
# Omit the -e flag to install the project as a regular package
pip install --no-build-isolation -e .

πŸ‘€ Prerequisites

Before running the rendering commands, ensure you have completed the ViPE inference on your video using the provided script:

# First, run ViPE inference
./scripts/infer_vipe.sh

ViPE Inference Arguments

The scripts run ViPE inference with various parameters. Below are the key CLI arguments used:

Core Arguments

  • --start_frame <int>: Starting frame number (default: 0)

  • --end_frame <int>: Ending frame number (inclusive, default: process all frames)

  • --assume_fixed_camera_pose: Flag to assume camera pose is fixed throughout the video (⚠️ Since EgoX is trained on the Ego-Exo4D dataset where exocentric view camera poses are fixed, you must provide exocentric videos with fixed camera poses as input during inference)

  • --pipeline <str>: Pipeline configuration to use (we used lyra for EgoX)

    • Available pipelines: default, lyra, lyra_no_vda, no_vda, etc.
    • default: Uses UniDepthV2 for depth estimation
    • lyra: Uses MoGE2 for depth estimation with VDA enabled for better temporal depth consistency
    • lyra_no_vda / no_vda: Disables Video Depth Anything (VDA) for reduced GPU memory usage
  • --use_exo_intrinsic_gt "<intrinsics_matrix>": Use ground truth exocentric camera intrinsics instead of ViPE-estimated intrinsics (e.g., when GT intrinsics are known such as Ego-Exo4D)

    • Takes a 3x3 intrinsics matrix in JSON format: [[fx, 0, cx], [0, fy, cy], [0, 0, 1]]
    • Automatically sets optimize_intrinsics=False when provided
    • The GT intrinsics are scaled based on current frame resolution (using cy ratio)
    • Example: --use_exo_intrinsic_gt "[[1000.0,0,960.0],[0,1000.0,540.0],[0,0,1]]"

Visualizing ViPE Results

After ViPE inference, you can visualize the results using the built-in visualization tool:

vipe visualize vipe_results/YOUR_VIPE_RESULT

Visualization Options

  • --port <int>: Server port (default: 20540)

  • --use_mean_bg: Use mean background for visualization (Since EgoX is trained with fixed exocentric camera poses, this option helps visualize cleaner point clouds for static objects)

  • --ego_manual: Enable manual ego trajectory annotation mode. Use this option when you want to obtain ego trajectory directly from in-the-wild videos.

    Manual annotation workflow:

    1. For each frame, position the ego camera frustum to align with the appropriate head pose in the 3D view
    2. Fill in the ego_extrinsics field in meta.json using the ego camera extrinsics displayed in the top-right UI panel."
    3. Repeat for all frames to build the complete ego trajectory
    4. See Appendix Fig. 8 in the paper for examples of frustum positioning aligned with head poses

    Important Note for In-the-Wild Videos:

    Since ego trajectories are manually annotated for in-the-wild videos, the final rendering results can vary significantly depending on how you position the ego camera frustums. Different annotation strategies may lead to different visual perspectives in the rendered ego-view videos.

    Below is a comparison showing EgoX generation results from two different ego trajectory annotations for the same exocentric input video (Ironman scene).

    Version 1 Version 2

The visualization tool provides an interactive 3D viewer where you can:

  • Inspect point clouds and camera poses
  • Validate depth map quality
  • Manually annotate ego trajectories for in-the-wild videos (with --ego_manual flag)

πŸ‘€ Ego Prior Rendering

For convenient batch processing, use the provided rendering script:

./scripts/render_vipe.sh

This script executes the point cloud rendering with the following configuration:

  • --input_dir: ViPE inference results directory
  • --out_dir: Output directory for rendered results
  • --meta_json_path: JSON file which include camera parameters
  • --point_size: Point cloud visualization size
  • --start_frame/--end_frame: Frame range (both inclusive)
  • --fish_eye_rendering: Enables fish-eye distortion rendering
  • --use_mean_bg: Uses mean background for rendering
  • --only_bg: Renders only the background point clouds (exclude dynamic instance's point clouds)

Camera Parameters Format

The meta.json file should contain camera intrinsics and extrinsics in the following format:

{
  "test_datasets": [
    {
      "exo_path": "./example/in_the_wild/videos/joker/exo.mp4",
      "ego_prior_path": "./example/in_the_wild/videos/joker/ego_Prior.mp4",
      "camera_intrinsics": [[fx, 0, cx], [0, fy, cy], [0, 0, 1]],
      "camera_extrinsics": [[r11, r12, r13, tx], [r21, r22, r23, ty], [r31, r32, r33, tz]],
      "ego_intrinsics": [[fx, 0, cx], [0, fy, cy], [0, 0, 1]],
      "ego_extrinsics": [
        [[r11, r12, r13, tx], [r21, r22, r23, ty], [r31, r32, r33, tz]],
        ...
      ]
    }
  ]
}

All extrinsics matrices are in world-to-camera format (3x4). The script will automatically convert them to 4x4 format by adding [0, 0, 0, 1] as the last row.

Manual Rendering Command

For manual execution or custom configurations, you can also run the rendering script directly:

python scripts/render_vipe_pointcloud.py \
  --input_dir vipe_results/YOUR_VIDEO_NAME \
  --meta_json_path /path/to/meta.json \
  --out_dir /path/to/output \
  --start_frame 0 \
  --end_frame 100 \
  --point_size 5.0 \
  --fish_eye_rendering \
  --use_mean_bg

Output Structure

The rendered results will be saved as MP4 videos (30 FPS) in the following structure:

example/egoexo4D/videos/
β”œβ”€β”€ cmu_soccer_06_6_877_925/
β”‚   β”œβ”€β”€ ego_Prior.mp4
β”‚   └── exo.mp4
β”œβ”€β”€ iiith_cooking_57_2_2451_2499/
β”‚   β”œβ”€β”€ ego_Prior.mp4
β”‚   └── exo.mp4
β”œβ”€β”€ sfu_basketball014_4_1000_1048/
β”‚   β”œβ”€β”€ ego_Prior.mp4
β”‚   └── exo.mp4
└── ...

Each result is saved in a directory named after the input ViPE result (e.g., vipe_results/joker β†’ joker/ego_prior.mp4).

Example of Ego Prior Rendering

πŸ‘€ Converting Depth Maps for EgoX Model

After ViPE inference, you need to convert the depth maps from .zip archives (containing .exr files) to .npy format that the EgoX model can process:

python scripts/convert_depth_zip_to_npy.py \
  --depth_path {EgoX_path}/vipe_results/YOUR_VIDEO/depth \
  --egox_depthmaps_path {EgoX_path}/example/egoexo4D/depth_maps

This script will:

  • Extract all .exr depth maps from the zip archive(s) in the specified directory
  • Convert them to .npy format
  • Save them to {egox_depthmaps_path}/{zip_filename}/ directory structure

Note: This conversion step is independent of EgoPrior rendering and is specifically required as a preprocessing step before feeding data into the EgoX model.

Performance Tips

  • Tuning ViPE inference: You can adjust temporal and spatial consistency in ViPE inference results by:
    • Changing the underlying models used internally by ViPE (e.g., switching depth estimation models)
    • Adjusting model sizes (e.g., using larger models for better quality or smaller models for faster processing)
    • Modifying pipeline configurations to balance between temporal consistency and 3D spatial consistency
  • Use the visualization tools (vipe visualize) to preview results before running extensive rendering jobs
  • The rendering quality depends on the depth estimation quality from the original ViPE inference

πŸ‘€ EgoExo4D Training Data Preprocessing

For pre-processing of EgoExo4D data for training EgoX, we provide a comprehensive preprocessing pipeline that automates ViPE inference and ego prior rendering for multiple takes.

πŸš€ Quick Start

To get started quickly with the example data:

bash data_preprocess/scripts/infer_vipe_all_takes.sh

This will process the example data in data_preprocess/example/ and generate ego prior videos. The script automatically:

  • Runs ViPE inference on all takes in the example dataset
  • Generates meta.json files from ego pose annotations
  • Renders ego prior videos for each camera
  • Selects the best camera based on quality metrics
  • Saves results to data_preprocess/data/{START_FRAME}_{END_FRAME}/best_ego_view_rendering/

For custom datasets, configure data_preprocess/scripts/config.sh with your data paths before running the script. See the sections below for detailed instructions.

Data Structure

The preprocessing pipeline expects the following directory structure:

your_data_directory/
β”œβ”€β”€ takes/
β”‚   β”œβ”€β”€ take_name_1/
β”‚   β”‚   └── frame_aligned_videos/
β”‚   β”‚       └── downscaled/
β”‚   β”‚           └── 448/
β”‚   β”‚               β”œβ”€β”€ cam01.mp4
β”‚   β”‚               β”œβ”€β”€ cam02.mp4
β”‚   β”‚               └── ...
β”‚   └── take_name_2/
β”‚       └── ...
β”œβ”€β”€ annotations/
β”‚   └── ego_pose/
β”‚       └── test/
β”‚           └── camera_pose/
β”‚               β”œβ”€β”€ uuid_1.json
β”‚               β”œβ”€β”€ uuid_2.json
β”‚               └── ...
└── captures.json

Example Data: See data_preprocess/example/ for a minimal example of the required data structure with 3 sample takes.

Configuration

  1. Edit the configuration file (data_preprocess/scripts/config.sh):
# Paths
WORKING_DIR="/path/to/your/output/directory"  # Output directory
DATA_DIR="/path/to/your/egoexo4d/data"        # Input data directory (read-only)

# Frame range
START_FRAME=0
END_FRAME=49  # Or auto-calculated: END_FRAME=$((START_FRAME + 49 - 1))

# Rendering
POINT_SIZE="5.0"

# Multiprocessing
BATCH_SIZE=6  # Number of parallel processes (recommended: 6-8)
  1. Key Configuration Parameters:
    • WORKING_DIR: Directory where all output files (ViPE results, rendered videos, metadata) will be saved
    • DATA_DIR: Path to your EgoExo4D dataset directory containing takes/, annotations/, and captures.json
    • START_FRAME / END_FRAME: Frame range to process (default: 0-48 for 49 frames)
    • BATCH_SIZE: Number of takes to process in parallel

Running the Preprocessing Pipeline

After configuring config.sh, run the batch processing script:

cd /path/to/EgoX-EgoPriorRenderer
bash data_preprocess/scripts/infer_vipe_all_takes.sh

The script will:

  1. Load all takes from DATA_DIR/takes/
  2. Run ViPE inference for each camera in each take (using the lyra pipeline)
  3. Generate meta.json files automatically from ego_pose annotations
  4. Render ego prior videos for each camera
  5. Select the best camera based on rendering quality metrics
  6. Save final results to WORKING_DIR/data/{START_FRAME}_{END_FRAME}/best_ego_view_rendering/

Output Structure

The preprocessing pipeline generates the following output structure:

WORKING_DIR/
β”œβ”€β”€ data/
β”‚   └── {START_FRAME}_{END_FRAME}/
β”‚       β”œβ”€β”€ best_ego_view_rendering/
β”‚       β”‚   β”œβ”€β”€ take_name_1/
β”‚       β”‚   β”‚   β”œβ”€β”€ ego_Prior/
β”‚       β”‚   β”‚   β”‚   └── ego_Prior.mp4
β”‚       β”‚   β”‚   β”œβ”€β”€ exo_GT/
β”‚       β”‚   β”‚   β”‚   └── frame_*.png
β”‚       β”‚   β”‚   β”œβ”€β”€ ego_GT/
β”‚       β”‚   β”‚   β”‚   └── frame_*.png
β”‚       β”‚   β”‚   └── metadata.json
β”‚       β”‚   └── take_name_2/
β”‚       β”‚       └── ...
β”‚       β”œβ”€β”€ vipe_results/
β”‚       β”‚   └── take_name_1/
β”‚       β”‚       └── camera_result_subdir/
β”‚       β”‚           β”œβ”€β”€ pose/
β”‚       β”‚           β”œβ”€β”€ rgb/
β”‚       β”‚           β”œβ”€β”€ depth/
β”‚       β”‚           └── ...
β”‚       └── meta_files/
β”‚           └── meta_take_name_result_subdir.json
└── take_name_to_uuid_mapping.json

Advanced Options

You can also specify batch size via command-line arguments:

bash data_preprocess/scripts/infer_vipe_all_takes.sh --batch-size 8

Notes

  • The script automatically creates a UUID mapping file (take_name_to_uuid_mapping.json) from ego_pose annotations if it doesn't exist
  • Processing can be resumed: the script skips takes that already have completed results in best_ego_view_rendering/
  • Error logs are saved to WORKING_DIR/data/{START_FRAME}_{END_FRAME}/.error/ for debugging
  • The best camera selection is based on rendering quality metrics (frames with white pixels, total white pixels)
  • If you want to reproduce the train/val dataset, please refer to the dataset info, download data and metadata from EgoExo4D, and follow the preprocessing pipeline

πŸ™ Acknowledgements

This EgoX's ego prior rendering codebase is built upon the ViPE(Video Pose Engine) project. We gratefully acknowledge their excellent work in video pose estimation and depth map generation. For more details, please visit the ViPE GitHub repository.

About

Official implementation of EgoX EgoPrior Renderer: Generate ego-view videos from exo-view input using ViPE-based 3D reconstruction and point cloud rendering

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors