Model Reference#

This page details the options available when running inference with the Cosmos-Predict2.5 base models.

Auto Multiview Inference#

Multiview inference requires a minimum of 8 GPUs with at least 80GB memory each.

The following example runs multi-GPU inference with the example asset:

torchrun --nproc_per_node=8 examples/multiview.py -i assets/multiview/urban_freeway.json -o outputs/multiview_video2world --inference-type=video2world

All variants require sample input videos. For Text2World, they are not used. For Image2World, only the first frame is used. For Video2World, the first 2 frames are used.

Variant

Arguments

Text2World

-o outputs/multiview_text2world --inference-type=text2world

Image2World

-o outputs/multiview_image2world --inference-type=image2world

Video2World

-o outputs/multiview_video2world --inference-type=video2world

Example Outputs#

The following is an example of output from the Text2World variant:

Robot Action-Conditioned Inference#

The following example runs inference with the example asset:

python examples/action_conditioned.py -i assets/action_conditioned/basic/inference_params.json -o outputs/action_conditioned/basic

Note

Action conditioned inference does not yet support multi-GPU.

Configuration#

Configuration is split into two parts:

  • Setup Arguments (ActionConditionedSetupArguments): Model-related configuration that typically stays the same across runs

    • model: Model variant to use (default: robot/multiview)

    • context_parallel_size: Context parallelism is not supported for action conditioned model. Set context_parallel_size to 1.

    • output_dir: Output directory for results

    • config_file: Model configuration file

2. Inference Arguments (ActionConditionedInferenceArguments): Per-run parameters that can vary

  • input_root: Root directory containing videos and annotations

  • input_json_sub_folder: Subdirectory containing JSON annotations

  • chunk_size: Action chunk size for processing

  • guidance: Guidance scale for generation

  • action_load_fn: Function to load action data

  • And many more…

JSON Configuration File#

The following is an example of a JSON configuration file:

{
  "name": "my_inference",
  "input_root": "/path/to/input/data",
  "input_json_sub_folder": "annotations",
  "save_root": "/path/to/output",
  "chunk_size": 12,
  "guidance": 7,
  "camera_id": "base",
  "start": 0,
  "end": 100,
  "action_load_fn": "cosmos_predict2.action_conditioned.load_default_action_fn"
}

Custom Action Loading#

To use a custom action loading function, implement a function following this signature:

def custom_action_load_fn():
    def load_fn(json_data: dict, video_path: str, args: ActionConditionedInferenceArguments) -> dict:
        # Your custom action loading logic here
        return {
            "actions": actions,  # numpy array of actions
            "initial_frame": img_array,  # first frame
            "video_array": video_array,  # full video
            "video_path": video_path,
        }
    return load_fn

You can then specify the function in your JSON config:

{
  "action_load_fn": "my_module.custom_action_load_fn"
}

Example Outputs#