Reference#

This page contains additional details about running inference with Cosmos-Reason2.

Transformers#

Cosmos-Reason2 is included in transformers>=4.57.0.

You can run the minimal example as follows:

./scripts/inference_sample.py

Deployment#

For deployment and batch inference, we recommend using vllm.

Inference Server#

Start the server:

uv run vllm serve nvidia/Cosmos-Reason2-2B \
  --allowed-local-media-path "$(pwd)" \
  --max-model-len 8192 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --reasoning-parser qwen3 \
  --port 8000

Arguments:

  • --max-model-len 8192: Maximum model length to avoid OOM.

  • --media-io-kwargs '{"video": {"num_frames": -1}}': Allow overriding FPS per sample.

  • --reasoning-parser qwen3: Parse reasoning trace.

  • --port 8000: Server port. Change if you encounter Address already in use errors.

Wait a few minutes for the server to startup. Once complete, it will print Application startup complete.. Open a new terminal to run inference commands.

Caption a video as follows (sample output is in assets/outputs/caption.log):

uv run cosmos-reason2-inference online --port 8000 -i prompts/caption.yaml --videos assets/sample.mp4 --fps 4

Embodied reasoning with verbose output (sample output is in assets/outputs/embodied_reasoning.log):

uv run cosmos-reason2-inference online -v --port 8000 -i prompts/embodied_reasoning.yaml --reasoning --images assets/sample.png

To list available parameters:

uv run cosmos-reason2-inference online --help

Arguments:

  • --model nvidia/Cosmos-Reason2-2B: Model name or path.

Offline Inference#

Temporally caption a video and save the input frames to outputs/temporal_localization for debugging (sample output is in (assets/outputs/temporal_localization.log)):

uv run cosmos-reason2-inference offline -v --max-model-len 8192 -i prompts/temporal_localization.yaml --videos assets/sample.mp4 --fps 4 -o outputs/temporal_localization

To list available parameters:

uv run cosmos-reason2-inference offline --help

Quantization#

For model quantization, we recommend using llmcompressor

Quantization example is in scripts/quantize.py (sample output is in assets/outputs/quantize.log):

./scripts/quantize.py -o /tmp/cosmos-reason2/checkpoints

To list available parameters:

./scripts/quantize.py --help

Arguments:

  • --model nvidia/Cosmos-Reason2-2B: Model name or path.

  • --precision fp4: Precision to use for quantization.