Bring your own checkpoint#

You can fine-tune Cosmos Transfer 2.5 2B with your own dataset by following instructions in the official repository.

Using your own checkpoint in the NIM#

Mount directory with your own checkpoint to the NIM container and set the corresponding environment variables.

Example:

# if you have a finetuned edge control checkpoint in /path/to/folder/with/checkpoints/edge.pt
# set the path to the folder with the checkpoints
export CUSTOM_WEIGHTS_PATH_DIR=/path/to/folder/with/checkpoints
# set the name of the checkpoint to be used for the edge control
export EDGE_CHECKPOINT_NAME=edge.pt

docker run --name=transfer2 \
   --runtime=nvidia \
   --shm-size=32GB \
   --gpus=all \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -v $CUSTOM_WEIGHTS_PATH_DIR:/opt/nim/checkpoint \
   -e NGC_API_KEY=$NGC_API_KEY \
   -e NIM_PERF_PROFILE="latency" \
   -e NIM_EDGE_CHECKPOINT="/opt/nim/checkpoint/$EDGE_CHECKPOINT_NAME" \
   -p 8000:8000 \
   --ulimit nofile=65536:65536 \
   $IMG_NAME

If you want to use more than one finetuned control, please set the corresponding environment variables: NIM_EDGE_CHECKPOINT, NIM_VIS_CHECKPOINT, NIM_DEPTH_CHECKPOINT, NIM_SEG_CHECKPOINT. Any combination of the four is allowed.

Used checkpoints will be shown in the logs:

Step 1/3  Quantizing checkpoint
  Variant(s):  edge vis depth seg
  Output dir:  /opt/nim/.cache/trt_build/f627a6c1e98a/trt/quantized
  edge: /opt/nim/checkpoint/model_ema_bf16.pt
  vis: (default)
  depth: (default)
  seg: (default)

Note

When running with FP8 precision, first startup requires FP8 calibration and TRT engine compilation, which takes a couple of hours. There might be some periods without new logs. It does not mean that the process is stuck. Please use nvidia-smi to verify that the process is running. Subsequent startups use the cached engines and take same amount of time as the startup with default checkpoint.

Note

When starting the container with FP8 precision, you may see a warnings like “FP8 metadata keys will be generated during calibration — the _extra_state”. This is normal and expected. It indicates that FP8 calibration is in progress; no action is required.

Note

Ensure you use the -v cache mount flag. The FP8 calibrated model will be stored there to avoid re-running the calibration and engine compilation on subsequent startups.

To use BF16 precision, please set the NIM_TAGS_SELECTOR="precision=bf16" environment variable.

Troubleshooting#

If you encounter issues with loading the checkpoint please check if you are mounting the path to the checkpoint in the docker correctly and your provided path is pointing to the checkpoint inside the container not on the host system.