Support Matrix#
Software#
CUDA 12.9 is configured in the microservice container.
NVIDIA GPU Driver version 525 and higher. NVIDIA verified the documentation using 570.133.20.
A container runtime environment such as Docker or Kubernetes. For Docker, refer to the instructions from Docker.
NVIDIA Container Toolkit installed and configured. Refer to installation in the toolkit documentation.
An active subscription to an NVIDIA AI Enterprise product or be an NVIDIA Developer Program member. Access to the containers and models is restricted.
An NGC API key. You need the key to download the container from NVIDIA NGC. The container uses the key to download models from NVIDIA NGC. Refer to Generating Your NGC API Key in the NVIDIA NGC User Guide for more information.
About Model Profiles#
The models for NVIDIA NIM microservices use model engines that are tuned for specific NVIDIA GPU models, number of GPUs, precision, and so on. NVIDIA produces model engines for several popular combinations and these are referred to as model profiles. Each model profile is identified by a unique 64-character string of hexadecimal digits that is referred to as a profile ID.
The NIM microservices support automatic profile selection by determining the GPU model and count on the node and attempting to match the optimal model profile. Alternatively, NIM microservices support running a specified model profile, but this requires that you review the profiles and know the profile ID.
The available model profiles are stored in a file in the NIM container file system.
The file is referred to as the model manifest file and the default path is /opt/nim/etc/default/model_manifest.yaml in the container.
NVIDIA Llama 3.1 Nemotron Safety Guard 8B NIM Model Profiles#
The model requires 48 GB of GPU memory. NVIDIA developed and tested the microservice using the following GPUs:
B200 SXM
H200 SXM
H100 SXM
A100 SXM 40GB and 80GB
A100 PCIe 40GB and 80GB (supported with generic model profiles only)
A10G
L40S PCIe
H100 NVL
H200 NVL
GB200 NVL72
GH200 NVL2
GH200 96GB
RTX PRO 6000 Blackwell Server Edition
You can use a single GPU with that capacity or two GPUs that meet the capacity.
For information about locally-buildable and generic model profiles, refer to Model Profiles in NVIDIA NIM for LLMs in the NIM for LLMs documentation.
Optimized Model Profiles#
GPU  | 
Precision  | 
# of GPUs  | 
LoRA  | 
LLM Engine  | 
Number Of GPUs  | 
Profile  | 
Disk Space  | 
Profile ID  | 
|---|---|---|---|---|---|---|---|---|
B200  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.53 GB  | 
8258db301f5fb0a5b5a5b9c263af9334ffc0714c6ae82d2a0687b95c11f5834e
 | 
B200  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.67 GB  | 
1c67491281ac66f32ca917bc566808bf4657ad20dec137f2b40c78d95b3a40dd
 | 
B200  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.03 GB  | 
d9bab56236ea000a0e0760e76d2530d7e460dde85cb6487b496009cbbb9ea7b2
 | 
B200  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.1 GB  | 
7b2460a744467de931be74543f3b1be6a4540edd8f5d3523c31aca253d3ee714
 | 
GB200  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.53 GB  | 
45999aa05eca849f11beb20bbf70dd48f2970a01dcf4c2617302b8038fca1662
 | 
GB200  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.67 GB  | 
cfca3a90be399e2fc6b91dfe19aa141fe7db0ad114df365cf43d77c675711049
 | 
GB200  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
37.05 GB  | 
20ac8ad6d3cb6b083facc320669292ab06972015f432839c64c841494beb4177
 | 
GB200  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.1 GB  | 
3b86e1c4eafac6dd59612ed5cea6878177867773d567fcc0e0127ad5b2b1221b
 | 
GH200 144GB HBM3e  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.55 GB  | 
c53b2566c86bf5ca53d25883b3f305f7fcfa1b8a854a52bf21d6901fec27fd3f
 | 
GH200 144GB HBM3e  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.72 GB  | 
052a14156d375521d64c584a0197a00ab3c54ae742b55145f1df091072656de7
 | 
GH200 144GB HBM3e  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.04 GB  | 
8f0e0665f4f53a02e2e68ab690ba990d5d9a3a6b49cdcd4ff679031757de30ed
 | 
GH200 144GB HBM3e  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.13 GB  | 
542edd6068b3fee7bb9431ba988f167dfc9f6e5b6dbf005b2e208d42bd17d705
 | 
GH200 120GB  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.55 GB  | 
8534f70d957abfcbb4353276dcb754f0a0820e90f555afe233be835f8d5c9743
 | 
GH200 120GB  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Latency  | 
23.55 GB  | 
9784974c7d8bd2e2081fc751d02a70bd619b1b16d535d3ff764622c1b7e73ff2
 | 
GH200 120GB  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.04 GB  | 
076e5a3eec39670d90ef75725966fabc895074660b42a0ba3edce65c0eb15756
 | 
GH200 120GB  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Latency  | 
30.04 GB  | 
f41c136a67469ae0bda89f96d78cb8c2b9c01c27d0ac618112248025320817c3
 | 
H200 NVL  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.55 GB  | 
7a4c46a6911fff7e11e99df87161eeae73b08c48cc75f1928e8289fbacc188fc
 | 
H200 NVL  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.71 GB  | 
373b366ba1d0bf15e9c7087967fadcc00db30c079943feee2ae0e2489aee9d66
 | 
H200 NVL  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.04 GB  | 
58041a8daba4de24b75c86456d256fbc34e881de58ebdab6135aad18466d8130
 | 
H200 NVL  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.13 GB  | 
dbb457d9b5a45d0a6976c0ba1a8ee6072deb8fe64c49a12e47ba9c71863618d2
 | 
H200  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.55 GB  | 
908a7d5dbabe5e1a7956b66d90747b34d44edb357546ab098474efc28a1f95f2
 | 
H200  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.72 GB  | 
5a708fe91514e2aa44438158af955b35d51fab4ca1fb7268e35930e67fce6e08
 | 
H200  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
37.07 GB  | 
1b953829736de1f2ffc00c8d25316a24d43ed398b5f8599957c9547007fd5471
 | 
H200  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.2 GB  | 
158a13eff79873eb73689daf87c365fa06946f74856646e54edc69728ef59a8e
 | 
H100 NVL  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.55 GB  | 
f3fd46462fe31ea44e5207d9f97ea190d8928b5688f5752445bb7dfd8afc999c
 | 
H100 NVL  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.66 GB  | 
5e8a78e4d0c9e2e513466ec23ac181ae8d75ce05bda5c4653eddf8f3a99f2d58
 | 
H100 NVL  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.04 GB  | 
dbf1e0e5f43bcabfdd0791b0ffa83036fd7cf6be28fd5266990cf94bb8ffc338
 | 
H100 NVL  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.13 GB  | 
a91cc87e8e98c7e88967319c273392e447fab72dd22aa8231630b573284525b2
 | 
H100 80GB SXM  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.56 GB  | 
21879fb81cb016c598cb4324a7cfa93acac1904ebc8cbfd9fff42d79a929b19d
 | 
H100 80GB SXM  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.72 GB  | 
3a8d53337b395483663f3d96e4528a759d1474c5afa8b68d74a39e99fad7aebf
 | 
H100 80GB SXM  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.04 GB  | 
dfe2b2518ce82cb404a639f9039d956ec1b8f35703e7594d066ef080d1172156
 | 
H100 80GB SXM  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.13 GB  | 
2f69689bf8fef4118bb018bb07869fc2d4b6eb3185115b2117ad62150f5d0006
 | 
A100 80GB SXM  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.16 GB  | 
8125bb2032b5649b5152192498543b77543de174d39f18dfb7473a6332c9b3fd
 | 
A100 80GB SXM  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.13 GB  | 
f5a41523d4dd3de5340276b18dfcbce17bb047b2457a36c1310e61589cc28935
 | 
A100 40GB SXM  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.02 GB  | 
3f24e5c0ba016486c057ff9997caa9341f1f974f5dbf53e96fdbf9a0558bc9ab
 | 
A100 40GB SXM  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.45 GB  | 
bd225e66799ed1e826ae5e2d7db017e1ba9a73855807e4f362967f22b2d7fecb
 | 
A10G  | 
BF16  | 
4  | 
False  | 
TensorRT-LLM  | 
4  | 
Throughput  | 
33.73 GB  | 
105f6654f47c50206ac6dc9662aa774e1d8bad4b7268e1db3358ac0e446d9abc
 | 
A10G  | 
BF16  | 
8  | 
False  | 
TensorRT-LLM  | 
8  | 
Latency  | 
37.65 GB  | 
80b75e35feecbcc4f3cc33ea634318bff503afcea1924a140f9fe9c3a848b99c
 | 
L40S  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.55 GB  | 
032ae037e7cddf2e43f37341027d8060e5eb5082797ed3e39d803f1c6af4eb84
 | 
L40S  | 
FP8  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
23.7 GB  | 
f282d4039fc42e3ab8a69854daf1a3a9e0fdce7974d06c3924969e3196e4ac08
 | 
L40S  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
30.03 GB  | 
7dedab87edfe691a1eb95d7bdc3fa74cc3e7fb6eeb6b56ab933f2ea3561d4c8d
 | 
L40S  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.11 GB  | 
260a963986f4b26d55e2af38ef19fd45341314564cc55f6b3f968e0323a8daea
 | 
RTX PRO 6000 Blackwell Server Edition  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Latency  | 
23.56 GB  | 
9325fdaeebf47da133de2bffcbfdcfb7537de5f0cf72adb091fb5bb8429047ad
 | 
RTX PRO 6000 Blackwell Server Edition  | 
FP8  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
23.56 GB  | 
b5938999f25208fbf5ef64ad1ce0d3978ac52d5b18d32f9b7e1551cbc97737db
 | 
RTX PRO 6000 Blackwell Server Edition  | 
BF16  | 
1  | 
False  | 
TensorRT-LLM  | 
1  | 
Throughput  | 
37.16 GB  | 
86edc198bd7a53981bd95f612e07e204727e116bdf779f1405e0c8ff92e0f067
 | 
RTX PRO 6000 Blackwell Server Edition  | 
BF16  | 
2  | 
False  | 
TensorRT-LLM  | 
2  | 
Latency  | 
31.36 GB  | 
5f2aea268787b1aea54e98d370f99685f2d6066310615ff946cf4f76242dcbfc
 | 
Generic Model Profiles#
Precision  | 
# of GPUs  | 
LoRA  | 
LLM Engine  | 
Disk Space  | 
Profile ID  | 
|---|---|---|---|---|---|
BF16  | 
1  | 
False  | 
vLLM  | 
14.97 GB  | 
47804570e8db67fdc60d9cb518b0f60316ac4eb0f3a55af81aafefe124505a8f
 | 
BF16  | 
2  | 
False  | 
vLLM  | 
14.97 GB  | 
60cc64256f69a6242aa2b7320a1070ae84e88fe5416674cbbe9a2cb6ecb8e992
 | 
BF16  | 
4  | 
False  | 
vLLM  | 
14.97 GB  | 
511022fb9a93b2c8b1eb84f4c472f05833e6904e51de7fd676be168a0aa9d63d
 | 
BF16  | 
8  | 
False  | 
vLLM  | 
14.97 GB  | 
a98cb8c244d2fd95e65a102758e4b039c94864f322c5b64bc9ee5749ddac59ab
 |