Is this page helpful?

Get Started With NVIDIA NeMo Retriever Embedding NIM#

This documentation helps you get started with NVIDIA NeMo Retriever Embedding NIM.

Prerequisites#

Before you can get started, you need the following:

Verify that you have supported hardware and software. For details, refer to the support matrix.
If you are running on an RTX AI PC or Workstation, install WSL2. For instructions, refer to NIM on WSL2 documentation.
Create an account on NVIDIA NGC and generate an API key to access the NIM container images and model assets on NGC. For instructions, refer to the NGC Authentication section that follows.

Note

Deploying on Kubernetes is not supported for WSL.

NGC Authentication#

Generate your API key#

To access the NIM container images and model assets on NGC, you must generate a personal API key. To create your key, go to https://org.ngc.nvidia.com/setup/api-keys.

When you create your key, for Services Included, select the following:

NGC Catalog
Private Registry (if you are an Early Access participant)

You can include more services if you are going to use this key for other purposes. For more information, refer to the NGC User Guide.

Export the API key#

To conveniently use your API key in the commands in the following sections, you can export your key as an environment variable named NGC_API_KEY. For example, run the following code in your terminal.

export NGC_API_KEY=<your API key value>

Run one of the following commands to make your key available when you start a new terminal session.

# If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Note

Other, more secure options include saving your API key value in a file (retrieve it by using cat $NGC_API_KEY_FILE), or saving your key in a password manager.

Docker Login to NGC#

Before you can pull the NIM container image from NGC, first authenticate to NGC by using the following command.

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Accept the License Terms#

Some NIM models require that you accept the license terms on NGC before you can pull the container image and model assets. To accept the license terms, browse to the model or container page on the NGC Catalog, read and then click Accept Terms.

Launch the NIM#

The following commands launch a Docker container for the llama-nemotron-embed-vl-1b-v2 model. For Docker versions >= 19.03, the --runtime=nvidia option has the same effect as the --gpus all option.

Starting in version 2.0.0, the NIM supports two model download providers, selected via NIM_MODEL_DOWNLOAD_PROVIDER. The default is hf (Hugging Face); set to ngc to download from NGC instead. Pick the flow that matches your environment.

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-nemotron-embed-vl-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE/cache" "$LOCAL_NIM_CACHE/weights"

The runtime uses separate directories for cached artifacts and model weights. Mount the artifact cache at /opt/cache and the model weights cache at /model. The NIM_MODEL_PATH environment variable controls where model weights are downloaded inside the container. For standard deployment, leave NIM_MODEL_PATH unset; the embedding NIM downloads weights to /model/embed, and the reranking NIM downloads weights to /model/rerank. For fine-tuned model deployment, mount the local model weights into /model and set NIM_MODEL_PATH to the mounted path.

Hugging Face provider (default)#

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e HF_TOKEN \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

NGC provider#

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NIM_MODEL_DOWNLOAD_PROVIDER=ngc \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE/cache:/opt/cache" \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Flags	Description
`-it`	`--interactive` + `--tty` (see Docker docs).
`--rm`	Delete the container after it stops.
`--name=llama-nemotron-embed-vl-1b-v2`	Give a name to the NIM container for bookkeeping.
`--runtime=nvidia`	Ensure NVIDIA drivers are accessible in the container.
`--gpus all`	Expose all NVIDIA GPUs inside the container.
`--shm-size=16GB`	Allocate host memory for multi-GPU communication. Not required for single-GPU models or GPUs with NVLink.
`-e HF_TOKEN`	Hugging Face access token. Required when using the default `hf` download provider.
`-e NGC_API_KEY`	NGC API key. Required when using the `ngc` download provider.
`-e NIM_MODEL_DOWNLOAD_PROVIDER=ngc`	Force the NIM to download model artifacts from NGC instead of Hugging Face.
`-v "$LOCAL_NIM_CACHE/cache:/opt/cache"`	Mount a cache directory so downloaded artifacts can be reused by follow-up runs.
`-v "$LOCAL_NIM_CACHE/weights:/model"`	Mount a model weights cache directory. Leave `NIM_MODEL_PATH` unset to use the default model path for the NIM.
`-u $(id -u)`	Use the same user as your system user inside the container to avoid permission mismatches on the cache directory.
`-p 8000:8000`	Publish the NIM HTTP port.
`$IMG_NAME`	NIM container image from NGC.

GPU clusters with GPUs in Multi-instance GPU mode (MIG) are not supported.

Verify the NIM is Running#

After you launch the NIM, it might take a few seconds for the service to be ready to accept requests.

To verify that the service is ready, run the following code.

curl -X 'GET' 'http://localhost:8000/v1/health/ready'

If the service is ready, you should see a response similar to the following.

{"object":"health.response","message":"ready","ready":true}

Run Inference#

To generate an embedding for a text query, run the following code.

curl -X "POST" \
  "http://localhost:8000/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "input": ["What is NVIDIA?"],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "query",
    "modality": "text",
    "embedding_type": "float",
    "encoding_format": "float"
}'

The response contains one embedding per input. The following response is shortened for readability.

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.012519836, -0.0126571655, 0.0101623535]
    }
  ],
  "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

To generate an embedding for an image document, run the following code. The code creates a small RGB PNG locally, writes a request body, and sends it to the NIM. The request uses the same model as the text query example, but with input_type set to passage and modality set to image. The image is base64-encoded and included in a data URL in the input array.

python3 - <<'PY'
import base64
import json
import struct
import zlib

width = height = 224
rows = []
for y in range(height):
    row = bytearray([0])
    for x in range(width):
        row.extend((x % 256, y % 256, 128))
    rows.append(bytes(row))

def chunk(kind, data):
    return (
        struct.pack(">I", len(data))
        + kind
        + data
        + struct.pack(">I", zlib.crc32(kind + data) & 0xFFFFFFFF)
    )

png = (
    b"\x89PNG\r\n\x1a\n"
    + chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, 8, 2, 0, 0, 0))
    + chunk(b"IDAT", zlib.compress(b"".join(rows), 9))
    + chunk(b"IEND", b"")
)

payload = {
    "input": ["data:image/png;base64," + base64.b64encode(png).decode()],
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "input_type": "passage",
    "modality": "image",
    "embedding_type": "float",
    "encoding_format": "float",
}

with open("payload-image.json", "w", encoding="utf-8") as f:
    json.dump(payload, f)
PY

curl -X "POST" \
  "http://localhost:8000/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d @payload-image.json

For more information, see the Reference.

Deploy on Multiple GPUs#

Support for multi-GPU deployment is not included in the 2.0.0 release of the embedding and reranking NIMs.

Deploy Alongside Other NIMs on the Same GPU#

You can deploy NeMo Retriever Embedding NIM alongside another NIM (for example, llama3-8b-instruct) on the same GPU (for example, A100 80GB, A100 40GB, or H100 80GB). For more information about deployment, see Get Started with NIM LLM, NIM Operator, GPU Operator with MIG, and Time-Slicing GPUs in Kubernetes.

Use the docker run --gpus command-line argument to specify the same GPU as shown in the following code.

docker run --gpus '"device=1"' ... $IMG_NAME
docker run --gpus '"device=1"' ... <your-llm-image, such as nvcr.io/nim/meta/llama-3.1-8b-instruct:2.0.5>

Download NIM Models to Cache#

Starting in version 2.0.0, the NIM selects kernels automatically at startup based on the GPU’s compute capability, and model weights download on the first run. There is no list-model-profiles step.

To pre-fetch model weights, for example to stage them for an air-gapped deployment, run the container with NIM_ENGINE_MODEL_DOWNLOAD_ONLY=1 and mount the host weights directory to /model. The NIM downloads the model weights before CUDA initialization, exits after the download completes, and does not require the --gpus Docker argument for this download-only run.

The default model download provider is Hugging Face. Set HF_TOKEN for the default provider. To download from NGC instead, replace -e HF_TOKEN with -e NIM_MODEL_DOWNLOAD_PROVIDER=ngc -e NGC_API_KEY.

# Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/llama-nemotron-embed-vl-1b-v2
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/nvidia/$CONTAINER_NAME:2.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE/cache" "$LOCAL_NIM_CACHE/weights"

# Download model weights without starting the server.
docker run -it --rm --name=$CONTAINER_NAME \
  -e NIM_ENGINE_MODEL_DOWNLOAD_ONLY=1 \
  -e HF_TOKEN \
  -v "$LOCAL_NIM_CACHE/weights:/model" \
  -u $(id -u) \
  $IMG_NAME

# After the command exits, model weights are present under $LOCAL_NIM_CACHE/weights/embed for the embedding NIM or $LOCAL_NIM_CACHE/weights/rerank for the reranking NIM.

For a fully air-gapped deployment, download the model weights on a connected computer, stage the weights root directory on the target host, and mount that directory at /model at runtime. The staged directory must contain the embed or rerank subdirectory created by the download-only command. Mount a separate writable runtime cache at /opt/cache. Do not pass HF_TOKEN or NGC_API_KEY on the air-gapped host.

For staging custom or fine-tuned model artifacts instead of the default model weights, refer to Custom Model Artifact Support in NVIDIA NeMo Retriever Embedding NIM.

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia --gpus all --shm-size=16GB \
  --network=none \
  -v /path/to/staged/weights:/model:ro \
  -v /path/to/runtime/cache:/opt/cache \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Stop the Container#

To stop the Docker container, run the following code.

docker stop $CONTAINER_NAME

To remove the Docker container, run the following code. If you included the --rm flag when you started the container, you don’t need this step.

docker rm $CONTAINER_NAME