Configure NeMo Retriever Text Embedding NIM#
NeMo Text Retriever NIM use docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. The remainder of this section describes the various ways to configure a NIM container.
Use this documentation to learn how to configure NeMo Retriever Text Embedding NIM.
GPU Selection#
The NIM container is GPU-accelerated and uses NVIDIA Container Toolkit for access to GPUs on the host.
You can specify the --gpus all
command-line argument to the docker run
command if the host has one or more of the same GPU model.
If the host has a combination of GPUs, such as an A6000 and a GeForce display GPU, run the container on compute-capable GPUs only.
Expose specific GPUs to the container by using either of the following methods:
Specify the
--gpus
argument, such as--gpus="device=1"
.Set the
NVIDIA_VISIBLE_DEVICES
environment variable, such as-e NVIDIA_VISIBLE_DEVICES=1
.
Run the nvidia-smi -L
command to list the device IDs to specify in the argument or environment variable:
GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
Refer to GPU Enumeration in the NVIDIA Container Toolkit documentation for more information.
Performance Mode#
The NeMo Retriever Text Embedding NIM can run in modes optimized for low latency or high throughput. The performance mode is controlled by the NIM_TRITON_PERFORMANCE_MODE
environment variable.
When set to latency
optimized mode (default), the NIM optimizes for minimum latency on short sequence length and low batch size workloads. When running in this mode, the NIM provides good performance on all workloads but is not as performant as throughput
mode on the highest batch sizes and longest sequence lengths.
When set to throughput
optimized mode, the NIM optimizes for maximum throughput on long sequence length and high batch size workloads. This mode provides upwards of 2x performance when used with large batches containing 64-1000 long and diverse length sequences. This mode has higher latency when processing short sequences in small batches.
Memory Footprint#
You can configure the memory footprint of the NeMo Retriever Text Embedding NIM by adjusting the model’s maximum allowed batch size and sequence length.
For ONNX model profiles, memory is allocated dynamically according to the requests. A maximum batch size and sequence length limit memory usage. You can specify a value for maximum batch size and sequence length from 1 up to the maximum supported limit for a given model and GPU.
For TensorRT model profiles, memory is allocated statically based on the optimized static inference graph which has a defined maximum input shape. You must specify a value from a discrete set of options. Refer to the support matrix for the valid values and corresponding approximate memory footprint.
By default, the NIM uses the largest possible value (given the model and GPU constraints) for both the maximum batch size and the maximum sequence length. If you specify only one of these parameters, the NIM uses the largest possible value for the unspecified parameter. For example, if you only specify a limit for maximum batch size, the NIM uses the largest possible sequence length.
Environment Variables#
The following table identifies the environment variables that are used in the container.
Set environment variables with the -e
command-line argument to the docker run
command.
Name |
Description |
Default Value |
---|---|---|
|
Set this variable to the value of your personal NGC API key. |
None |
|
Specifies the fully qualified path, in the container, for downloaded models. |
|
|
Specifies the network port number, in the container, for gRPC access to the microservice. |
|
|
Specifies the network port number, in the container, for HTTP access to the microservice. Refer to Publishing ports in the Docker documentation for more information about host and container network ports. |
|
|
Specifies the number of worker threads to start for HTTP requests. |
|
|
Specifies the network port number, in the container, for NVIDIA Triton Inference Server. |
|
|
When set to |
|
|
When set to |
|
|
Specifies the logging level. The microservice supports the following values: DEBUG, INFO, WARNING, ERROR, and CRITICAL. |
|
|
Set to |
|
|
Specifies the fully qualified path, in the container, for the model manifest YAML file. |
|
|
Specifies the model profile ID to use with the container. By default, the container attempts to automatically match the host GPU model and GPU count with the optimal model profile. |
None |
|
The number of model instances to deploy. |
Unset (this value overrides a hardware-specific config value) |
|
The number of tokenizer instances to use. |
|
|
Specifies the model names used in the API.
Specify multiple names in a comma-separated list.
If you specify multiple names, the server responds to any of the names.
The name in the model field of a response is the first name in this list.
By default, the model is inferred from the |
None |
|
For the NVIDIA Triton Inference Server, sets the max queue delayed time to allow other requests to join the dynamic batch. For more information, refer to the Triton User Guide. |
|
|
When set to |
|
|
Specify the batch size for the underlying Triton instance. The value must be less than or equal to maximum batch size that was used to compile the engine. If the model uses the |
None |
|
Specify the maximum batch size that can be processed by the Triton server. By default, the NIM uses the maximum possible batch size for a given model and GPU. To decrease the memory footprint of the server, choose a smaller maximum batch size. Only discrete values are supported. Refer to the NIM’s support matrix for valid values (and their estimated memory footprint). |
None |
|
Specify the maximum sequence length that can be processed by the Triton server. By default, the NIM uses the maximum possible sequence length for a given model and GPU. To decrease the memory footprint of the server, choose a smaller maximum sequence length. Only discrete values are supported. Refer to the NIM’s support matrix for valid values (and their estimated memory footprint). |
None |
|
Controls the performance mode of the NIM.
When set to |
|
|
Specifies the gRPC port number, for NVIDIA Triton Inference Server. |
|
Volumes#
The following table identifies the paths that are used in the container. Use this information to plan the local paths to bind mount into the container.
Container Path |
Description |
Example |
---|---|---|
|
Specifies the path, relative to the root of the container, for downloaded models. The typical use for this path is to bind mount a directory on the host with this path inside the container.
For example, to use If you do not specify a bind or volume mount, as shown in the The |
|