Is this page helpful?

Configure NVIDIA NeMo Retriever Reranking NIM#

NVIDIA NeMo Retriever Reranking NIM use docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. The remainder of this section describes the various ways to configure a NIM container.

Use this documentation to learn how to configure NeMo Retriever Reranking NIM.

GPU Selection#

The NIM container is GPU-accelerated and uses NVIDIA Container Toolkit for access to GPUs on the host.

You can specify the --gpus all command-line argument to the docker run command if the host has one or more of the same GPU model. If the host has a combination of GPUs, such as an A6000 and a GeForce display GPU, run the container on compute-capable GPUs only.

Expose specific GPUs to the container by using either of the following methods:

Specify the --gpus argument, such as --gpus="device=1".
Set the NVIDIA_VISIBLE_DEVICES environment variable, such as -e NVIDIA_VISIBLE_DEVICES=1.

Run the nvidia-smi -L command to list the device IDs to specify in the argument or environment variable:

GPU 0: Tesla H100 (UUID: GPU-a1111111-aaaa-bbbb-cccc-dddddddddddd)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-e2222222-ffff-gggg-hhhh-iiiiiiiiiiii)

Refer to GPU Enumeration in the NVIDIA Container Toolkit documentation for more information.

Engine Count#

The NIM_ENGINE_COUNT environment variable controls the number of reranking inference engines that run inside the NIM. The current runtime default is 1.

For the maximum compatibility profile across supported GPUs, keep or set NIM_ENGINE_COUNT=1. For maximum performance on GPU SKUs with at least 80 GB of VRAM, set NIM_ENGINE_COUNT=2. Increasing the engine count can improve throughput, but also increases VRAM usage.

Note

Use 1 to pin the maximum compatibility profile, or 2 for maximum performance on GPU SKUs with at least 80 GB of VRAM.

For Docker deployments, include -e NIM_ENGINE_COUNT=1 or -e NIM_ENGINE_COUNT=2 in the Docker run command. For details, refer to Get Started With NVIDIA NeMo Retriever Reranking NIM.

For Helm chart deployments, set envVars.NIM_ENGINE_COUNT in the Helm values. For details, refer to Reranking Engine Count.

Memory Footprint#

The memory footprint of the NVIDIA NeMo Retriever Reranking NIM depends on the model backend and loaded profiles.

Memory is pre-allocated based on the maximum input shapes. Refer to the support matrix for the approximate memory footprint by compute capability.

PID Limit#

In certain deployment or container runtime environments, default process and thread limits (PID limits) can interfere with NIM startup. These set limits are set by Docker, Podman, Kubernetes, or the operating system.

If the PID limit is too low, you might see symptoms such as:

NIM starts up partially, but fails to reach ready state, and then stalls.
NIM starts up partially, but fails to reach ready state, and then crashes.
NIM serves a small number of requests, and then fails.

To verify that PID limits are impacting the NIM container, you can remove or adjust the PID limit at the container, node, and operating system level. Removing the PID limit and then checking for success is a useful diagnostic step.

To increase the PID limit in a docker run command, set --pids-limit=-1. For details, see docker container run.
To increase the PID limit in a podman run command, set --pids-limit=-1. For details, see Podman pids-limit.
To increase the PID limit in Kubernetes, set the PodPidsLimit on the kubelet on each node. For details, see your Kubernetes distribution specific documentation.
To increase the PID limit at the operating system level, see your OS-specific documentation.

Shared Memory Flag#

Tokenization uses capabilities that scale with the number of CPU cores available. You may need to increase the available shared memory given to the microservice container.

Example providing 1g of shared memory:

docker run ... --shm-size=1g ...

Volumes#

The following table identifies the paths that are used in the container. Use this information to plan the local paths to bind mount into the container.

Container Path

Description

Example

/opt/cache

Specifies the path, relative to the root of the container, for runtime artifacts such as compiled CUDA and cuDNN plan cache files.

The typical use for this path is to bind mount a directory on the host with this path inside the container. For example, to use ~/.cache/nim/cache on the host, run mkdir -p ~/.cache/nim/cache before you start the container. When you start the container, specify the -v ~/.cache/nim/cache:/opt/cache -u $(id -u) arguments to the docker run command.

If you do not specify a bind or volume mount, as shown in the -v argument in the preceding command, the container regenerates runtime artifacts when the container starts with an empty cache.

The -u $(id -u) argument runs the container with your user ID to avoid file system permission issues and errors.

-v ~/.cache/nim/cache:/opt/cache -u $(id -u)

/model

Specifies the root path, relative to the root of the container, for model weights.

The default NIM_MODEL_PATH is /model/embed for the embedding NIM and /model/rerank for the reranking NIM. The typical use for this path is to bind mount a directory on the host with this path inside the container. For example, to use ~/.cache/nim/weights on the host, run mkdir -p ~/.cache/nim/weights before you start the container. When you start the container, specify the -v ~/.cache/nim/weights:/model -u $(id -u) arguments to the docker run command.

If you do not specify a bind or volume mount, as shown in the -v argument in the preceding command, the container downloads model weights into the container filesystem when the model is not already present.

-v ~/.cache/nim/weights:/model -u $(id -u)