World Scenario Video Generation#

This tool generates control videos from 3D scene annotations for Cosmos-Transfer2.5. It renders world models into videos by projecting 3D elements (polylines, polygons, and cuboids) onto camera views.

Supported input formats:

  • Parquet format: Structured scene annotations in parquet files

  • RDS-HQ format: NVIDIA’s internal format from the Cosmos-Drive-Dreams dataset

Additional Requirements#

In addition to the standard Transfer2.5 Prerequisites, you will need the following:

  • Python 3.10+

  • UV (for dependency management)

  • A GPU with EGL support (for headless OpenGL rendering)

  • 3D scene annotation data in Parquet or RDS-HQ format

Install Dependencies#

Use the following command to install dependencies:

uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Generate Control Videos#

The script automatically detects whether your input is in Parquet or RDS-HQ format.

The following command generates control videos (all seven cameras by default):

python scripts/generate_control_videos.py -i /path/to/{input_root} -o ./{save_root}

The following command generates control videos for specific cameras only:

python scripts/generate_control_videos.py -i {input_root}/ -o {save_root}/ \
    --cameras "camera:front:wide:120fov,camera:cross:right:120fov"

Command Options#

Option

Default

Description

--cameras

all

A comma-separated list of camera names, or “all” for all seven cameras

Available Cameras#

  • camera:front:wide:120fov

  • camera:front:tele:sat:30fov

  • camera:cross:right:120fov

  • camera:cross:left:120fov

  • camera:rear:left:70fov

  • camera:rear:right:70fov

  • camera:rear:tele:30fov

Complete Example#

The following end-to-end example uses Parquet input data:

# Download example data
wget -P assets https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/assets/multiview_example1.zip && unzip -oq assets/multiview_example1.zip -d assets

# Generate control videos for the example scene
python scripts/generate_control_videos.py -i assets/multiview_example1/scene_annotations -o outputs/multiview_example1_world_scenario_videos

Additional example datasets:

wget https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/assets/multiview_example2.zip
wget https://github.com/nvidia-cosmos/cosmos-dependencies/releases/download/assets/multiview_example3.zip

RDS-HQ Example#

To use data from the Cosmos-Drive-Dreams dataset in RDS-HQ format:

wget -P scripts https://raw.githubusercontent.com/nv-tlabs/Cosmos-Drive-Dreams/main/scripts/download.py
python scripts/download.py --odir ./assets/rdshq-data --limit 1
python scripts/generate_control_videos.py -i assets/rdshq-data -o outputs/rdshq-generated

Data Format#

Input Structure#

Parquet format:

scene_annotations_directory/
├── uuid.obstacle.parquet              (required)
├── uuid.calibration_estimate.parquet  (required)
├── uuid.egomotion_estimate.parquet    (required)
├── uuid.lane.parquet                  (optional)
├── uuid.lane_line.parquet             (optional)
└── ... (other optional parquet files)

RDS-HQ format: NVIDIA’s recording format containing sensor data and annotations. The script automatically extracts the required scene information.

Output Structure#

Both input formats produce the same output structure:

save_root/
└── uuid/
    ├── uuid.camera_front_wide_120fov.mp4
    ├── uuid.camera_front_tele_sat_30fov.mp4
    ├── uuid.camera_cross_right_120fov.mp4
    ├── uuid.camera_cross_left_120fov.mp4
    ├── uuid.camera_rear_left_70fov.mp4
    ├── uuid.camera_rear_right_70fov.mp4
    ├── uuid.camera_rear_tele_30fov.mp4

Rendered Elements#

The following elements are always rendered:

  • 3D bounding boxes for vehicles, pedestrians, and other dynamic objects (from the required obstacle.parquet file)

The following elements are optionally rendered if the corresponding Parquet file is provided:

  • Lane lines, lanes, and road boundaries

  • Crosswalks, road markings, and wait lines

  • Poles, traffic lights, and traffic signs

Troubleshooting#

Issue

Solution

ModernGL/EGL errors

Install GPU drivers and EGL libraries (libGL.so.1, libEGL.so.1). On Ubuntu/Debian: apt install libegl1-mesa-dev libgl1-mesa-dri

Missing parquet files

Ensure the required files exist: obstacle.parquet, calibration_estimate.parquet, egomotion_estimate.parquet

Memory issues

Reduce the number of cameras processed simultaneously using --cameras

Invalid camera names

Run with --help to see valid options

Rendering Specifications#

Dynamic Objects#

Dynamic objects are rendered as solid 3D cuboids with light gray edges and front-to-back color gradients.

Object label mapping covers five categories:

  1. Car: automobile, other_vehicle, vehicle

  2. Truck: heavy_truck, bus, train_or_tram_car, trailer

  3. Pedestrian: person

  4. Cyclist: rider

  5. Others: protruding_object, animal, stroller

Lane Lines#

Lane lines are categorized into 15 types based on color (yellow, white, other) and style (solid, dashed, dotted, solid-dashed combinations). For example, yellow solid dashed means a yellow solid line (right) and yellow dashed line (left) in the polyline direction.

Traffic Lights#

Traffic lights are rendered as cuboids with four states: Red, Yellow, Green, Unknown.

Map Elements#

Map elements use three geometry types:

  • Polylines: poles, road boundaries, wait lines

  • Polygons: crosswalks, road markings

  • Cuboids: traffic signs

Pipeline Overview#

Frame Rate Configuration#

The pipeline uses two configurable frame rates:

  • INPUT_POSE_FPS (default: 30fps): Processing frame rate for interpolation — determines how many frames are generated

  • TARGET_RENDER_FPS (default: 30fps): Output video playback frame rate — determines playback speed

Source data is typically at 10Hz and is interpolated to the processing frame rate.

Processing Steps#

  1. Load camera calibration

  2. Parse and interpolate egomotion trajectory to processing frame rate

  3. Interpolate obstacle tracks to match egomotion timestamps

  4. Transform all geometries from world to camera coordinates

  5. Render each frame using OpenGL

  6. Encode output as MP4 video

Coordinate Systems#

  • World coordinates: Right-handed system (x=forward, y=left, z=up)

  • Camera coordinates: Camera looks along the positive z-axis; x-axis is right, y-axis is down

  • FLU convention: Forward-Left-Up used for vehicle-to-camera transforms

Next Steps#

Generated control videos serve as conditioning inputs for Cosmos Transfer2.5 multiview inference. The HD map visualizations provide spatial context for video generation tasks.