Reference#
This page contains additional details about running inference with Cosmos-Reason2.
Transformers#
Cosmos-Reason2 is included in transformers>=4.57.0.
You can run the minimal example as follows:
./scripts/inference_sample.py
Deployment#
For deployment and batch inference, we recommend using vllm.
Inference Server#
Start the server:
uv run vllm serve nvidia/Cosmos-Reason2-2B \
--allowed-local-media-path "$(pwd)" \
--max-model-len 8192 \
--media-io-kwargs '{"video": {"num_frames": -1}}' \
--reasoning-parser qwen3 \
--port 8000
Arguments:
--max-model-len 8192: Maximum model length to avoid OOM.--media-io-kwargs '{"video": {"num_frames": -1}}': Allow overriding FPS per sample.--reasoning-parser qwen3: Parse reasoning trace.--port 8000: Server port. Change if you encounterAddress already in useerrors.
Wait a few minutes for the server to startup. Once complete, it will print Application startup complete.. Open a new terminal to run inference commands.
Caption a video as follows (sample output is in assets/outputs/caption.log):
uv run cosmos-reason2-inference online --port 8000 -i prompts/caption.yaml --videos assets/sample.mp4 --fps 4
Embodied reasoning with verbose output (sample output is in assets/outputs/embodied_reasoning.log):
uv run cosmos-reason2-inference online -v --port 8000 -i prompts/embodied_reasoning.yaml --reasoning --images assets/sample.png
To list available parameters:
uv run cosmos-reason2-inference online --help
Arguments:
--model nvidia/Cosmos-Reason2-2B: Model name or path.
Offline Inference#
Temporally caption a video and save the input frames to outputs/temporal_localization for debugging (sample output is in (assets/outputs/temporal_localization.log)):
uv run cosmos-reason2-inference offline -v --max-model-len 8192 -i prompts/temporal_localization.yaml --videos assets/sample.mp4 --fps 4 -o outputs/temporal_localization
To list available parameters:
uv run cosmos-reason2-inference offline --help
Quantization#
For model quantization, we recommend using llmcompressor
Quantization example is in scripts/quantize.py (sample output is in assets/outputs/quantize.log):
./scripts/quantize.py -o /tmp/cosmos-reason2/checkpoints
To list available parameters:
./scripts/quantize.py --help
Arguments:
--model nvidia/Cosmos-Reason2-2B: Model name or path.--precision fp4: Precision to use for quantization.