Support Matrix for NeMo Retriever Text Embedding NIM#

This documentation describes the software and hardware that NeMo Retriever Text Embedding NIM supports.

CPU#

Text Embedding NIM requires the following:

Models#

NeMo Retriever Text Embedding NIM supports the following models:

Model Name

Model ID

Max Tokens

Publisher

Parameters
(millions)

Embedding
Dimension

Dynamic Embeddings
Supported

Model Card

bge-large-zh-v1.5

baai/bge-large-zh-v1.5

512

BAAI

325

1024

no

Link

Llama-3.2-NV-EmbedQA-1B-v2

nvidia/llama-3.2-nv-embedqa-1b-v2

8192

NVIDIA

1236

2048

yes

Link

NV-EmbedQA-E5-v5

nvidia/nv-embedqa-e5-v5

512

NVIDIA

335

1024

no

Link

NV-EmbedQA-Mistral7B-v2

nvidia/nv-embedqa-mistral-7b-v2

512

NVIDIA

7110

4096

no

Link

Snowflake’s Arctic-embed-l

snowflake/arctic-embed-l

512

Snowflake

335

1024

no

Link

bge-m3

baai/bge-m3

8192

BAAI

560

1024

no

Link

Supported Hardware#

bge-large-zh-v1.5#

GPU

GPU Memory (GB)

Precision

H20

96

FP16

L20

48

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values in the following table are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

10

FP16

8.1

bge-m3#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP16

L40s

48

FP16

A10G

24

FP16

L20

48

FP16

H20

96

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

33

FP16

8.8

Llama-3.2-NV-EmbedQA-1B-v2#

GPU

GPU Memory (GB)

Precision

A100 SXM4

40 & 80

FP16

H100 HBM3

80

FP16 & FP8

H100 NVL

80

FP16 & FP8

L40s

48

FP16 & FP8

A10G

24

FP16

L4

24

FP16 & FP8

GeForce RTX 4090 (Beta)

24

FP16

NVIDIA RTX 6000 Ada Generation (Beta)

48

FP16

GeForce RTX 5080 (Beta)

16

FP16

GeForce RTX 5090 (Beta)

32

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Max Tokens

Any single NVIDIA GPU that has sufficient memory, or multiple homogenous NVIDIA GPUs that have sufficient memory in total.

3.6

FP16

9

4096

Warning

The maximum token length of the non-optimized configuration is smaller (4096) than the other profiles (8192).

NV-EmbedQA-E5-v5#

GPU

GPU Memory (GB)

Precision

A100 SXM4

40 & 80

FP16

H100 HBM3

80

FP16

H100 NVL

80

FP16

L40s

48

FP16

A10G

24

FP16

L4

24

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

2

FP16

8.5

NV-EmbedQA-Mistral7B-v2#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP8

H100 HBM3

80

FP16

L40s

48

FP8

L40s

48

FP16

A10G

24

FP16

L4

24

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

16

FP16

30

Snowflake’s Arctic-embed-l#

GPU

GPU Memory (GB)

Precision

A100 SXM4

80

FP16

H100 HBM3

80

FP16

L40s

48

FP16

A10G

24

FP16

L4

24

FP16

Non-optimized configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs

GPU Memory

Precision

Disk Space

Any NVIDIA GPU with sufficient GPU memory or on multiple, homogenous NVIDIA GPUs with sufficient aggregate memory

2

FP16

17

Memory Footprint#

You can control the NIM’s memory footprint by controlling the maximum allowed batch size and sequence length. For more information, refer to Memory Footprint.

The following table provides the set of valid configurations and the associated approximate memory footprints for the model. Please note that there is an additional small fixed memory overhead when running the container.

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

8192

11.25

30

1024

2.17

30

2048

4.81

30

4096

9.61

30

8192

21.09

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

8192

11.25

30

1024

2.17

30

2048

4.81

30

4096

9.61

30

8192

21.09

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

1024

1.16

16

2048

2.56

16

4096

5.12

16

8192

11.25

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

8192

11.25

30

1024

2.17

30

2048

4.81

30

4096

9.61

30

8192

21.09

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.43

8

8192

3.53

16

8192

7.06

30

1024

1.19

30

2048

2.61

30

4096

5.68

30

8192

13.24

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

8192

11.25

30

1024

2.17

30

2048

4.81

30

4096

9.61

30

8192

21.09

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.43

8

8192

3.53

16

8192

7.06

30

1024

1.19

30

2048

2.61

30

4096

5.68

30

8192

13.24

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

1024

1.16

16

2048

2.56

16

4096

5.12

16

8192

11.25

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.68

8

8192

5.53

16

1024

1.13

16

2048

2.39

16

4096

5.03

16

8192

11.06

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.7

8

8192

5.62

16

8192

11.25

30

1024

2.17

30

2048

4.81

30

4096

9.61

30

8192

21.09

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

8192

0.43

8

8192

3.53

16

8192

7.06

30

1024

1.19

30

2048

3.55

30

4096

7.56

30

8192

16.99

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

64

512

0.41

128

512

0.81

192

512

1.22

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

64

512

0.41

128

512

0.81

384

512

2.44

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

16

512

0.1

32

512

0.2

64

512

0.41

80

512

0.51

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

64

512

0.41

128

512

0.81

384

512

2.44

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

64

512

0.41

128

512

0.81

384

512

2.44

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

16

512

0.1

32

512

0.2

64

512

0.41

80

512

0.51

Max Batch Size

Max Sequence Length

Approximate GPU Memory Size (GB)

1

512

0.01

64

512

0.41

128

512

0.81

256

512

1.63

Software#

NVIDIA Driver#

Releases prior to 1.4.0-rtx use Triton Inference Server 24.08. Please refer to the Release Notes for Triton on NVIDIA driver support.

Release 1.4.0-rtx uses Triton Inference Server 25.01. Please refer to the Release Notes for Triton on NVIDIA driver support.

If issues arise when you start the NIM containers, run the following code to ensure that the latest NVIDIA drivers are installed.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
 && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
   sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
   sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

NVIDIA Container Toolkit#

Your Docker environment must support NVIDIA GPUs. For more information, refer to NVIDIA Container Toolkit.