Deployment Guide for NVIDIA Earth-2 Correction Diffusion NIM#

Use this documentation for details about how to deploy the NVIDIA Earth-2 Correction Diffusion (CorrDiff) NIM.

Important

Before you can use this documentation, you must satisfy all prerequisites.

View available NIM container information#

Container image tags can be seen with the following command using NGC CLI, similar to other container images on NGC.

ngc registry image info nvcr.io/nim/nvidia/corrdiff:1.0.0

Pull the container image#

Pull the container image using one of the following commands:

Docker#

docker pull nvcr.io/nim/nvidia/corrdiff:1.0.0

NGC CLI#

ngc registry image pull nvcr.io/nim/nvidia/corrdiff:1.0.0

Run the Container#

As in the quickstart guide, you can run the following command to start the CorrDiff NIM:

export LOCAL_NIM_CACHE=~/.cache/nim
export NGC_API_KEY=<NGC API Key>

docker run --rm --name corrdiff --runtime=nvidia --gpus all --shm-size 1g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
    -u $(id -u) \
    nvcr.io/nim/nvidia/corrdiff:1.0.0

The following is an overview of the command and its options as well as others available at NIM startup:

  • docker run — This is the command to run a new container from a Docker image.

  • --rm — This flag tells Docker to automatically remove the container when it exits. This property is useful for one-off runs or testing, as it prevents the container from being left behind.

  • --name corrdiff — This flag gives the container the name “corrdiff”.

  • --runtime=nvidia — This flag specifies the runtime to use for the container. In this case, it is set to “nvidia”, which is used for GPU acceleration.

  • --gpus all — Enable NIM to use all GPUs available on the machine the container is deployed on.

  • --shm-size 1g — Allocate host memory for Triton to use. You might need to increase this depending on deployment.

  • -p 8000:8000 — This flag maps port 8000 on the host machine to port 8000 in the container. This allows you to access the container’s services from the host machine.

  • -e NGC_API_KEY — This passes the NGC_API_KEY environment variable (and the value set in the parent terminal) to the container. This is used for authentication with NVIDIA’s NGC (NVIDIA GPU Cloud) service, including downloading the model data if it is not present in the NIM Cache.

  • -v <source>:<dest> — Mounts the host directory /home/$USER/.cache/nim in the container as /home/nvs/.cache so that it can be used for storing and reusing downloaded models.

  • -u $(id -u) — Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.

Checking the status of the CorrDiff NIM#

You can view running docker containers on your system by using the following command:

docker ps

This returns output that looks like the following if the NIM is running:

CONTAINER ID   IMAGE                 COMMAND                  CREATED          STATUS          PORTS                                                                                                          NAMES
d114948j4f55   corrdiff   "/opt/nvidia/nvidia_…"   46 minutes ago   Up 46 minutes   6006/tcp, 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp, 8888/tcp, 0.0.0.0:50051->50051/tcp, :::50051->50051/tcp   test1

The first column in the output is the docker container ID, which can be useful for interacting with the container. The remaining fields described the image the container is running, the command (in this case, the NIM server software), when the container was built, its status, how long it has been running, any open ports, and finally its name (given by the startup command).

To check that the NIM is fully started and healthy, use the health API which returns a 200 response if NIM is ready:

curl -X 'GET' \
    'http://localhost:8000/v1/health/ready' \
    -H 'accept: application/json'

Stop a running container#

If for some reason your container has entered a state in which you cannot interact with it, or for any reason it must be stopped, you can use the kill command with the ID obtained from the docker ps command. This immediately terminates the running container.

Note

Any in-flight requests are canceled, and data might be lost.

The following example stops the running container using the ID from the previous example.

docker kill d114948j4f55

Model Checkpoint Caching#

On initial startup, the container downloads the CorrDiff model parameters and supporting data from NGC. You can skip this download step on future runs by caching the model weights locally using a cache directory as in the example below.

# Create the cache directory on the host machine
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p $LOCAL_NIM_CACHE

# Run the container with the cache directory mounted in the appropriate location
docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -v $LOCAL_NIM_CACHE:/opt/nim/.cache \
    -u $(id -u) \
    nvcr.io/nim/nvidia/corrdiff:1.0.0

Note

Caching the model checkpoint can save a considerable amount of time on subsequent container runs.

Important

It is critical that the directory at LOCAL_NIM_CACHE provides read/write access to the user inside the docker container. If the NIM generates a permissions error when downloading the model checkpoint, the permissions are not correct with the folder being mounted.

Air Gap Deployment (offline cache)#

The CorrDiff NIM supports serving models in an air gap system (also known as air wall, air-gapping or disconnected network). Upon start up, if a NIM detects model files have already been downloaded, it serves the model directly from the cache. Users can pre-cache the model package of a NIM by mounting a local folder to cache, launching the NIM on hardware with access to NGC servers, then moving the cached model files to the respective air gapped system.

Start with an online deployment with a mounted model folder:

export LOCAL_NIM_CACHE=<Local folder>
export NGC_API_KEY=<NGC API Key>

docker run --rm --runtime=nvidia --gpus all --shm-size 4g \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -v ${LOCAL_NIM_CACHE}:/opt/nim/.cache \
    -u $(id -u) \
    -t nvcr.io/nim/nvidia/corrdiff:1.0.0

The model package should now be in LOCAL_NIM_CACHE. The NIM can now be launched using this cache without the NGC_API_KEY.