Support Matrix#

Software#

  • CUDA 12.9 is configured in the microservice container.

  • NVIDIA GPU Driver version 525 and higher. NVIDIA verified the documentation using 570.133.20.

  • A container runtime environment such as Docker or Kubernetes. For Docker, refer to the instructions from Docker.

  • NVIDIA Container Toolkit installed and configured. Refer to installation in the toolkit documentation.

  • An active subscription to an NVIDIA AI Enterprise product or be an NVIDIA Developer Program member. Access to the containers and models is restricted.

  • An NGC API key. You need the key to download the container from NVIDIA NGC. The container uses the key to download models from NVIDIA NGC. Refer to Generating Your NGC API Key in the NVIDIA NGC User Guide for more information.

About Model Profiles#

The models for NVIDIA NIM microservices use model engines that are tuned for specific NVIDIA GPU models, number of GPUs, precision, and so on. NVIDIA produces model engines for several popular combinations and these are referred to as model profiles. Each model profile is identified by a unique 64-character string of hexadecimal digits that is referred to as a profile ID.

The NIM microservices support automatic profile selection by determining the GPU model and count on the node and attempting to match the optimal model profile. Alternatively, NIM microservices support running a specified model profile, but this requires that you review the profiles and know the profile ID.

The available model profiles are stored in a file in the NIM container file system. The file is referred to as the model manifest file and the default path is /opt/nim/etc/default/model_manifest.yaml in the container.

NVIDIA Llama 3.1 Nemotron Safety Guard 8B NIM Model Profiles#

The model requires 48 GB of GPU memory. NVIDIA developed and tested the microservice using the following GPUs:

  • B200 SXM

  • H200 SXM

  • H100 SXM

  • A100 SXM 40GB and 80GB

  • A100 PCIe 40GB and 80GB (supported with generic model profiles only)

  • A10G

  • L40S PCIe

  • H100 NVL

  • H200 NVL

  • GB200 NVL72

  • GH200 NVL2

  • GH200 96GB

  • RTX PRO 6000 Blackwell Server Edition

You can use a single GPU with that capacity or two GPUs that meet the capacity.

For information about locally-buildable and generic model profiles, refer to Model Profiles in NVIDIA NIM for LLMs in the NIM for LLMs documentation.

Optimized Model Profiles#

GPU

Precision

# of GPUs

LoRA

LLM Engine

Number Of GPUs

Profile

Disk Space

Profile ID

B200

FP8

1

False

TensorRT-LLM

1

Throughput

23.53 GB

8258db301f5fb0a5b5a5b9c263af9334ffc0714c6ae82d2a0687b95c11f5834e

B200

FP8

2

False

TensorRT-LLM

2

Latency

23.67 GB

1c67491281ac66f32ca917bc566808bf4657ad20dec137f2b40c78d95b3a40dd

B200

BF16

1

False

TensorRT-LLM

1

Throughput

30.03 GB

d9bab56236ea000a0e0760e76d2530d7e460dde85cb6487b496009cbbb9ea7b2

B200

BF16

2

False

TensorRT-LLM

2

Latency

31.1 GB

7b2460a744467de931be74543f3b1be6a4540edd8f5d3523c31aca253d3ee714

GB200

FP8

1

False

TensorRT-LLM

1

Throughput

23.53 GB

45999aa05eca849f11beb20bbf70dd48f2970a01dcf4c2617302b8038fca1662

GB200

FP8

2

False

TensorRT-LLM

2

Latency

23.67 GB

cfca3a90be399e2fc6b91dfe19aa141fe7db0ad114df365cf43d77c675711049

GB200

BF16

1

False

TensorRT-LLM

1

Throughput

37.05 GB

20ac8ad6d3cb6b083facc320669292ab06972015f432839c64c841494beb4177

GB200

BF16

2

False

TensorRT-LLM

2

Latency

31.1 GB

3b86e1c4eafac6dd59612ed5cea6878177867773d567fcc0e0127ad5b2b1221b

GH200 144GB HBM3e

FP8

1

False

TensorRT-LLM

1

Throughput

23.55 GB

c53b2566c86bf5ca53d25883b3f305f7fcfa1b8a854a52bf21d6901fec27fd3f

GH200 144GB HBM3e

FP8

2

False

TensorRT-LLM

2

Latency

23.72 GB

052a14156d375521d64c584a0197a00ab3c54ae742b55145f1df091072656de7

GH200 144GB HBM3e

BF16

1

False

TensorRT-LLM

1

Throughput

30.04 GB

8f0e0665f4f53a02e2e68ab690ba990d5d9a3a6b49cdcd4ff679031757de30ed

GH200 144GB HBM3e

BF16

2

False

TensorRT-LLM

2

Latency

31.13 GB

542edd6068b3fee7bb9431ba988f167dfc9f6e5b6dbf005b2e208d42bd17d705

GH200 120GB

FP8

1

False

TensorRT-LLM

1

Throughput

23.55 GB

8534f70d957abfcbb4353276dcb754f0a0820e90f555afe233be835f8d5c9743

GH200 120GB

FP8

1

False

TensorRT-LLM

1

Latency

23.55 GB

9784974c7d8bd2e2081fc751d02a70bd619b1b16d535d3ff764622c1b7e73ff2

GH200 120GB

BF16

1

False

TensorRT-LLM

1

Throughput

30.04 GB

076e5a3eec39670d90ef75725966fabc895074660b42a0ba3edce65c0eb15756

GH200 120GB

BF16

1

False

TensorRT-LLM

1

Latency

30.04 GB

f41c136a67469ae0bda89f96d78cb8c2b9c01c27d0ac618112248025320817c3

H200 NVL

FP8

1

False

TensorRT-LLM

1

Throughput

23.55 GB

7a4c46a6911fff7e11e99df87161eeae73b08c48cc75f1928e8289fbacc188fc

H200 NVL

FP8

2

False

TensorRT-LLM

2

Latency

23.71 GB

373b366ba1d0bf15e9c7087967fadcc00db30c079943feee2ae0e2489aee9d66

H200 NVL

BF16

1

False

TensorRT-LLM

1

Throughput

30.04 GB

58041a8daba4de24b75c86456d256fbc34e881de58ebdab6135aad18466d8130

H200 NVL

BF16

2

False

TensorRT-LLM

2

Latency

31.13 GB

dbb457d9b5a45d0a6976c0ba1a8ee6072deb8fe64c49a12e47ba9c71863618d2

H200

FP8

1

False

TensorRT-LLM

1

Throughput

23.55 GB

908a7d5dbabe5e1a7956b66d90747b34d44edb357546ab098474efc28a1f95f2

H200

FP8

2

False

TensorRT-LLM

2

Latency

23.72 GB

5a708fe91514e2aa44438158af955b35d51fab4ca1fb7268e35930e67fce6e08

H200

BF16

1

False

TensorRT-LLM

1

Throughput

37.07 GB

1b953829736de1f2ffc00c8d25316a24d43ed398b5f8599957c9547007fd5471

H200

BF16

2

False

TensorRT-LLM

2

Latency

31.2 GB

158a13eff79873eb73689daf87c365fa06946f74856646e54edc69728ef59a8e

H100 NVL

FP8

1

False

TensorRT-LLM

1

Throughput

23.55 GB

f3fd46462fe31ea44e5207d9f97ea190d8928b5688f5752445bb7dfd8afc999c

H100 NVL

FP8

2

False

TensorRT-LLM

2

Latency

23.66 GB

5e8a78e4d0c9e2e513466ec23ac181ae8d75ce05bda5c4653eddf8f3a99f2d58

H100 NVL

BF16

1

False

TensorRT-LLM

1

Throughput

30.04 GB

dbf1e0e5f43bcabfdd0791b0ffa83036fd7cf6be28fd5266990cf94bb8ffc338

H100 NVL

BF16

2

False

TensorRT-LLM

2

Latency

31.13 GB

a91cc87e8e98c7e88967319c273392e447fab72dd22aa8231630b573284525b2

H100 80GB SXM

FP8

1

False

TensorRT-LLM

1

Throughput

23.56 GB

21879fb81cb016c598cb4324a7cfa93acac1904ebc8cbfd9fff42d79a929b19d

H100 80GB SXM

FP8

2

False

TensorRT-LLM

2

Latency

23.72 GB

3a8d53337b395483663f3d96e4528a759d1474c5afa8b68d74a39e99fad7aebf

H100 80GB SXM

BF16

1

False

TensorRT-LLM

1

Throughput

30.04 GB

dfe2b2518ce82cb404a639f9039d956ec1b8f35703e7594d066ef080d1172156

H100 80GB SXM

BF16

2

False

TensorRT-LLM

2

Latency

31.13 GB

2f69689bf8fef4118bb018bb07869fc2d4b6eb3185115b2117ad62150f5d0006

A100 80GB SXM

BF16

1

False

TensorRT-LLM

1

Throughput

30.16 GB

8125bb2032b5649b5152192498543b77543de174d39f18dfb7473a6332c9b3fd

A100 80GB SXM

BF16

2

False

TensorRT-LLM

2

Latency

31.13 GB

f5a41523d4dd3de5340276b18dfcbce17bb047b2457a36c1310e61589cc28935

A100 40GB SXM

BF16

1

False

TensorRT-LLM

1

Throughput

30.02 GB

3f24e5c0ba016486c057ff9997caa9341f1f974f5dbf53e96fdbf9a0558bc9ab

A100 40GB SXM

BF16

2

False

TensorRT-LLM

2

Latency

31.45 GB

bd225e66799ed1e826ae5e2d7db017e1ba9a73855807e4f362967f22b2d7fecb

A10G

BF16

4

False

TensorRT-LLM

4

Throughput

33.73 GB

105f6654f47c50206ac6dc9662aa774e1d8bad4b7268e1db3358ac0e446d9abc

A10G

BF16

8

False

TensorRT-LLM

8

Latency

37.65 GB

80b75e35feecbcc4f3cc33ea634318bff503afcea1924a140f9fe9c3a848b99c

L40S

FP8

1

False

TensorRT-LLM

1

Throughput

23.55 GB

032ae037e7cddf2e43f37341027d8060e5eb5082797ed3e39d803f1c6af4eb84

L40S

FP8

2

False

TensorRT-LLM

2

Latency

23.7 GB

f282d4039fc42e3ab8a69854daf1a3a9e0fdce7974d06c3924969e3196e4ac08

L40S

BF16

1

False

TensorRT-LLM

1

Throughput

30.03 GB

7dedab87edfe691a1eb95d7bdc3fa74cc3e7fb6eeb6b56ab933f2ea3561d4c8d

L40S

BF16

2

False

TensorRT-LLM

2

Latency

31.11 GB

260a963986f4b26d55e2af38ef19fd45341314564cc55f6b3f968e0323a8daea

RTX PRO 6000 Blackwell Server Edition

FP8

1

False

TensorRT-LLM

1

Latency

23.56 GB

9325fdaeebf47da133de2bffcbfdcfb7537de5f0cf72adb091fb5bb8429047ad

RTX PRO 6000 Blackwell Server Edition

FP8

1

False

TensorRT-LLM

1

Throughput

23.56 GB

b5938999f25208fbf5ef64ad1ce0d3978ac52d5b18d32f9b7e1551cbc97737db

RTX PRO 6000 Blackwell Server Edition

BF16

1

False

TensorRT-LLM

1

Throughput

37.16 GB

86edc198bd7a53981bd95f612e07e204727e116bdf779f1405e0c8ff92e0f067

RTX PRO 6000 Blackwell Server Edition

BF16

2

False

TensorRT-LLM

2

Latency

31.36 GB

5f2aea268787b1aea54e98d370f99685f2d6066310615ff946cf4f76242dcbfc

Generic Model Profiles#

Precision

# of GPUs

LoRA

LLM Engine

Disk Space

Profile ID

BF16

1

False

vLLM

14.97 GB

47804570e8db67fdc60d9cb518b0f60316ac4eb0f3a55af81aafefe124505a8f

BF16

2

False

vLLM

14.97 GB

60cc64256f69a6242aa2b7320a1070ae84e88fe5416674cbbe9a2cb6ecb8e992

BF16

4

False

vLLM

14.97 GB

511022fb9a93b2c8b1eb84f4c472f05833e6904e51de7fd676be168a0aa9d63d

BF16

8

False

vLLM

14.97 GB

a98cb8c244d2fd95e65a102758e4b039c94864f322c5b64bc9ee5749ddac59ab