Get Started With NVIDIA NeMo Retriever Embedding NIM#

This documentation helps you get started with NVIDIA NeMo Retriever Embedding NIM.

Prerequisites#

Before you can get started, you need the following:

  • Verify that you have supported hardware and software. For details, refer to the support matrix.

  • If you are running on an RTX AI PC or Workstation, install WSL2. For instructions, refer to NIM on WSL2 documentation.

  • Create an account on NVIDIA NGC and generate an API key to access the NIM container images and model assets on NGC. For instructions, refer to the NGC Authentication section that follows.

Note

Deploying on Kubernetes is not supported for WSL.

NGC Authentication#

Generate your API key#

To access the NIM container images and model assets on NGC, you must generate a personal API key. To create your key, go to https://org.ngc.nvidia.com/setup/api-keys.

When you create your key, for Services Included, select the following:

  • NGC Catalog

  • Private Registry (if you are an Early Access participant)

You can include more services if you are going to use this key for other purposes. For more information, refer to the NGC User Guide.

Export the API key#

To conveniently use your API key in the commands in the following sections, you can export your key as an environment variable named NGC_API_KEY. For example, run the following code in your terminal.

export NGC_API_KEY=<your API key value>

Run one of the following commands to make your key available when you start a new terminal session.

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc
# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Note

Other, more secure options include saving your API key value in a file (retrieve it by using cat $NGC_API_KEY_FILE), or saving your key in a password manager.

Docker Login to NGC#

Before you can pull the NIM container image from NGC, first authenticate to NGC by using the following command.

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Accept the License Terms#

Some NIM models require that you accept the license terms on NGC before you can pull the container image and model assets. To accept the license terms, browse to the model or container page on the NGC Catalog, read and then click Accept Terms.

Launch the NIM#

The following commands launch a Docker container for the llama-nemotron-embed-vl-1b-v2 model. For Docker versions >= 19.03, the --runtime=nvidia option has the same effect as the --gpus all option.

Starting in version 2.0.0, the NIM supports two model download providers, selected via NIM_MODEL_DOWNLOAD_PROVIDER. The default is hf (Hugging Face); set to ngc to download from NGC instead. Pick the flow that matches your environment.

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-nemotron-embed-vl-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE/cache" "$LOCAL_NIM_CACHE/weights"

The runtime uses separate directories for cached artifacts and model weights. Mount the artifact cache at /opt/cache and the model weights cache at /model. The NIM_MODEL_PATH environment variable controls where model weights are downloaded inside the container. For standard deployment, leave NIM_MODEL_PATH unset; the embedding NIM downloads weights to /model/embed, and the reranking NIM downloads weights to /model/rerank. For fine-tuned model deployment, mount the local model weights into /model and set NIM_MODEL_PATH to the mounted path.

Hugging Face provider (default)#

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e HF_TOKEN \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

NGC provider#

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_DOWNLOAD_PROVIDER=ngc \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Flags

Description

-it

--interactive + --tty (see Docker docs).

--rm

Delete the container after it stops.

--name=llama-nemotron-embed-vl-1b-v2

Give a name to the NIM container for bookkeeping.

--runtime=nvidia

Ensure NVIDIA drivers are accessible in the container.

--gpus all

Expose all NVIDIA GPUs inside the container.

--shm-size=16GB

Allocate host memory for multi-GPU communication. Not required for single-GPU models or GPUs with NVLink.

-e HF_TOKEN

Hugging Face access token. Required when using the default hf download provider.

-e NGC_API_KEY

NGC API key. Required when using the ngc download provider.

-e NIM_MODEL_DOWNLOAD_PROVIDER=ngc

Force the NIM to download model artifacts from NGC instead of Hugging Face.

-v "$LOCAL_NIM_CACHE/cache:/opt/cache"

Mount a cache directory so downloaded artifacts can be reused by follow-up runs.

-v "$LOCAL_NIM_CACHE/weights:/model"

Mount a model weights cache directory. Leave NIM_MODEL_PATH unset to use the default model path for the NIM.

-u $(id -u)

Use the same user as your system user inside the container to avoid permission mismatches on the cache directory.

-p 8000:8000

Publish the NIM HTTP port.

$IMG_NAME

NIM container image from NGC.

GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Verify the NIM is Running#

After you launch the NIM, it might take a few seconds for the service to be ready to accept requests.

To verify that the service is ready, run the following code.

curl -X 'GET' 'http://localhost:8000/v1/health/ready'

If the service is ready, you should see a response similar to the following.

{"object":"health.response","message":"ready","ready":true}

Run Inference#

To generate an embedding for a text query, run the following code.

curl -X "POST" \
  "http://localhost:8000/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is NVIDIA?"],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "query",
    "modality": "text",
    "embedding_type": "float",
    "encoding_format": "float"
}'

The response contains one embedding per input. The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.012519836, -0.0126571655, 0.0101623535]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

To generate an embedding for an image document, run the following code. The code creates a small RGB PNG locally, writes a request body, and sends it to the NIM. The request uses the same model as the text query example, but with input_type set to passage and modality set to image. The image is base64-encoded and included in a data URL in the input array.

python3 - <<'PY'
import base64
import json
import struct
import zlib

width = height = 224
rows = []
for y in range(height):
    row = bytearray([0])
    for x in range(width):
        row.extend((x % 256, y % 256, 128))
    rows.append(bytes(row))

def chunk(kind, data):
    return (
        struct.pack(">I", len(data))
        + kind
        + data
        + struct.pack(">I", zlib.crc32(kind + data) & 0xFFFFFFFF)
    )

png = (
    b"\x89PNG\r\n\x1a\n"
    + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
    + chunk(b"IDAT", zlib.compress(b"".join(rows), 9))
    + chunk(b"IEND", b"")
)

payload = {
    "input": ["data:image/png;base64," + base64.b64encode(png).decode()],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "passage",
    "modality": "image",
    "embedding_type": "float",
    "encoding_format": "float",
}

with open("payload-image.json", "w", encoding="utf-8") as f:
    json.dump(payload, f)
PY

curl -X "POST" \
  "http://localhost:8000/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @payload-image.json

For more information, see the Reference.

Deploy on Multiple GPUs#

Support for multi-GPU deployment is not included in the 2.0.0 release of the embedding and reranking NIMs.

Deploy Alongside Other NIMs on the Same GPU#

You can deploy NeMo Retriever Embedding NIM alongside another NIM (for example, llama3-8b-instruct) on the same GPU (for example, A100 80GB, A100 40GB, or H100 80GB). For more information about deployment, see Get Started with NIM LLM, NIM Operator, GPU Operator with MIG, and Time-Slicing GPUs in Kubernetes.

Use the docker run --gpus command-line argument to specify the same GPU as shown in the following code.

docker run --gpus '"device=1"' ... $IMG_NAME
docker run --gpus '"device=1"' ... <your-llm-image, such as nvcr.io/nim/meta/llama-3.1-8b-instruct:2.0.5>

Download NIM Models to Cache#

Starting in version 2.0.0, the NIM selects kernels automatically at startup based on the GPU’s compute capability, and model weights download on the first run. There is no list-model-profiles step.

To pre-fetch model weights, for example to stage them for an air-gapped deployment, start the container one time with separate directories mounted for model weights and runtime cache. The NIM downloads model weights to the default path under /model and writes runtime artifacts to /opt/cache. You can stop the container after the weights are present or start the container with NIM_PRECOMPILE_ONLY=1 to exit after artifacts are compiled and before the server starts.

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-nemotron-embed-vl-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE/cache" "$LOCAL_NIM_CACHE/weights"

# Start the NIM with the artifact and model weights caches mounted.
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia --gpus all --shm-size=16GB \
  -e HF_TOKEN \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

# After the weights are present in $LOCAL_NIM_CACHE/weights/embed and artifacts are present in $LOCAL_NIM_CACHE/cache, stop the container.

For a fully air-gapped deployment, download the model weights and runtime cache on a connected computer, stage them at paths on the target host, and mount those directories at runtime. Mount the staged model weights at the path specified by NIM_MODEL_PATH, and mount the runtime cache separately at /opt/cache:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia --gpus all --shm-size=16GB \
  --network=none \
  -e NIM_MODEL_PATH=/model/embed \
  -v /path/to/staged/weights:/model/embed:ro \
  -v /path/to/runtime/cache:/opt/cache \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Stop the Container#

To stop the Docker container, run the following code.

docker stop $CONTAINER_NAME

To remove the Docker container, run the following code. If you included the --rm flag when you started the container, you don’t need this step.

docker rm $CONTAINER_NAME