Bring your own checkpoint#

Bring your own checkpoint for cosmos-transfer2.5-2b#

You can fine-tune Cosmos Transfer 2.5 2B with your own dataset by following instructions in the official repository.

Using your own checkpoint in the NIM#

Mount directory with your own checkpoint to the NIM container and set the corresponding environment variables.

Example:

# if you have a finetuned edge control checkpoint in /path/to/folder/with/checkpoints/edge.pt
# set the path to the folder with the checkpoints
export CUSTOM_WEIGHTS_PATH_DIR=/path/to/folder/with/checkpoints
# set the name of the checkpoint to be used for the edge control
export EDGE_CHECKPOINT_NAME=edge.pt

docker run --name=transfer2 \
   --runtime=nvidia \
   --shm-size=32GB \
   --gpus=all \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -v $CUSTOM_WEIGHTS_PATH_DIR:/opt/nim/checkpoint \
   -e NGC_API_KEY=$NGC_API_KEY \
   -e NIM_PERF_PROFILE="latency" \
   -e NIM_EDGE_CHECKPOINT="/opt/nim/checkpoint/$EDGE_CHECKPOINT_NAME" \
   -p 8000:8000 \
   --ulimit nofile=65536:65536 \
   $IMG_NAME

If you want to use more than one finetuned control, please set the corresponding environment variables: NIM_EDGE_CHECKPOINT, NIM_VIS_CHECKPOINT, NIM_DEPTH_CHECKPOINT, NIM_SEG_CHECKPOINT. Any combination of the four is allowed.

Used checkpoints will be shown in the logs:

Step 1/3  Quantizing checkpoint
  Variant(s):  edge vis depth seg
  Output dir:  /opt/nim/.cache/trt_build/f627a6c1e98a/trt/quantized
  edge: /opt/nim/checkpoint/model_ema_bf16.pt
  vis: (default)
  depth: (default)
  seg: (default)

Note

When running with FP8 precision, first startup requires FP8 calibration and TRT engine compilation, which takes a couple of hours. There might be some periods without new logs. It does not mean that the process is stuck. Please use nvidia-smi to verify that the process is running. Subsequent startups use the cached engines and take same amount of time as the startup with default checkpoint.

Note

When starting the container with FP8 precision, you may see a warnings like “FP8 metadata keys will be generated during calibration — the _extra_state”. This is normal and expected. It indicates that FP8 calibration is in progress; no action is required.

Note

Ensure you use the -v cache mount flag. The FP8 calibrated model will be stored there to avoid re-running the calibration and engine compilation on subsequent startups.

To use BF16 precision, please set the NIM_TAGS_SELECTOR="precision=bf16" environment variable.

Troubleshooting#

If you encounter issues with loading the checkpoint please check if you are mounting the path to the checkpoint in the docker correctly and your provided path is pointing to the checkpoint inside the container not on the host system.

Bring your own checkpoint for Cosmos3-Generator#

Cosmos3-Generator supports loading a fine-tuned or pre-quantized checkpoint in place of the NGC-bundled model weights. The rest of the selected profile (parallelism, attention backend, guardrails) is unchanged — only the model weights themselves are swapped.

Override#

Set NIM_FT_CHECKPOINT=/abs/path/inside/container and bind-mount the host directory at that exact path:

docker run --rm -it \
    --runtime=nvidia \
    --gpus all \
    --shm-size 32GB \
    --ulimit nofile=65536:65536 \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -p 8000:8000 \
    -e NGC_API_KEY="${NGC_API_KEY}" \
    -e NIM_FT_CHECKPOINT=/byoc/cosmos3-finetuned \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -v /host/path/to/byoc:/byoc/cosmos3-finetuned:ro \
    $IMG_NAME

The host directory should be mounted read-only — the NIM never writes to the BYOC checkpoint. Profile-managed artifacts (the guardrail bundle, scheduler / VAE references, etc.) continue to live under /opt/nim/.cache, which must stay writable.

Expected directory layout#

/abs/path/inside/container/
├── transformer/
│   └── config.json     # quantization_config dictates the precision; shape dictates size
├── vae/
├── scheduler/
└── model_index.json

The four entries are required. transformer/ must hold config.json plus the serialized weight shards.

Auto-discovery and cross-check#

When NIM_FT_CHECKPOINT is set, the NIM:

  1. Auto-discovers the model size (nano / super) from the checkpoint’s transformer/config.json shape parameters and the precision (bf16 / fp8 / nvfp4) from transformer/config.json:quantization_config.

  2. Cross-checks those discovered values against the active profile (selected via NIM_MODEL_SIZE / NIM_PRECISION / NIM_TAGS_SELECTOR and the supportability gates). If the BYOC checkpoint does not match the selected profile (for example, NIM_MODEL_SIZE=nano but the checkpoint is the 32B (super) shape, or NIM_PRECISION=bf16 but the checkpoint is pre-quantized to FP8), the NIM raises a clear error showing both the expected and the received values and refuses to start.

The cross-check exists to catch profile / weights mismatches at boot time, before the NIM has loaded any model weights or compiled any engines, so misconfigurations fail fast rather than deep inside the inference path. To run a checkpoint of a different size or precision, update the matching NIM_* variable so the selected profile lines up with the BYOC checkpoint.

Operational notes#

  • First-start latency. A BYOC checkpoint with a different shape or quantization from the bundled artifact triggers a fresh TRT-LLM engine build. On a cold cache this can take several minutes (longer for the 32B (super) size). Treat GET /v1/health/ready returning 200 as the only correct signal to start sending inference traffic — do not poll /v1/infer while the engine is still building.

  • Engine-build ulimits. The TRT-LLM engine build that runs on the first launch under BYOC pins large memory regions and uses a deep call stack. Pass --ulimit memlock=-1 --ulimit stack=67108864 to docker run to avoid build-time Resource temporarily unavailable or stack-overflow failures during quantize / compile.

  • Cache permissions. The BYOC mount itself is read-only, but the container still needs a writable cache directory for the resolved TRT-LLM engine, intermediate artifacts, and the profile-managed guardrail bundle. Prepare the host cache the same way as in the quickstart (mkdir -p "$LOCAL_NIM_CACHE" && chmod -R 777 "$LOCAL_NIM_CACHE" 2>/dev/null || true) and mount it at /opt/nim/.cache.

  • NIM_FT_CHECKPOINT path rules. NIM_FT_CHECKPOINT must be an absolute path inside the container — no ~, no relative segments, no symlinks. The matching -v bind-mount target must use exactly the same path. GET /v1/metadata returns this path verbatim in the checkpoint field, so it is also what users will see when inspecting the running NIM.

Verifying the override#

The GET /v1/metadata endpoint exposes a checkpoint field. With BYOC active the value is the override path (whatever NIM_FT_CHECKPOINT was set to); otherwise it reports the bundled profile checkpoint reference. Use this endpoint to confirm at runtime which weights the NIM is serving with.

curl -s http://0.0.0.0:8000/v1/metadata | jq '.checkpoint'