Support Matrix#

Software#

CUDA 12.9 is configured in the microservice container.

NVIDIA GPU Driver version 525 and higher. NVIDIA verified the documentation using 570.133.20.
A container runtime environment such as Docker or Kubernetes. For Docker, refer to the instructions from Docker.
NVIDIA Container Toolkit installed and configured. Refer to installation in the toolkit documentation.
An active subscription to an NVIDIA AI Enterprise product or be an NVIDIA Developer Program member. Access to the containers and models is restricted.
An NGC API key. You need the key to download the container from NVIDIA NGC. The container uses the key to download models from NVIDIA NGC. Refer to Generating Your NGC API Key in the NVIDIA NGC User Guide for more information.

About Model Profiles#

The models for NVIDIA NIM microservices use model engines that are tuned for specific NVIDIA GPU models, number of GPUs, precision, and so on. NVIDIA produces model engines for several popular combinations and these are referred to as model profiles. Each model profile is identified by a unique 64-character string of hexadecimal digits that is referred to as a profile ID.

The NIM microservices support automatic profile selection by determining the GPU model and count on the node and attempting to match the optimal model profile. Alternatively, NIM microservices support running a specified model profile, but this requires that you review the profiles and know the profile ID.

The available model profiles are stored in a file in the NIM container file system. The file is referred to as the model manifest file and the default path is /opt/nim/etc/default/model_manifest.yaml in the container.

NVIDIA Llama 3.1 Nemotron Safety Guard 8B NIM Model Profiles#

The model requires 48 GB of GPU memory. NVIDIA developed and tested the microservice using the following GPUs:

B200 SXM
H200 SXM
H100 SXM
A100 SXM 40GB and 80GB
A100 PCIe 40GB and 80GB (supported with generic model profiles only)
A10G
L40S PCIe
H100 NVL
H200 NVL
GB200 NVL72
GH200 NVL2
GH200 96GB
RTX PRO 6000 Blackwell Server Edition

You can use a single GPU with that capacity or two GPUs that meet the capacity.

For information about locally-buildable and generic model profiles, refer to Model Profiles in NVIDIA NIM for LLMs in the NIM for LLMs documentation.

Optimized Model Profiles#

GPU	Precision	# of GPUs	LoRA	LLM Engine	Number Of GPUs	Profile	Disk Space	Profile ID
B200	FP8	1	False	TensorRT-LLM	1	Throughput	23.53 GB	8258db301f5fb0a5b5a5b9c263af9334ffc0714c6ae82d2a0687b95c11f5834e
B200	FP8	2	False	TensorRT-LLM	2	Latency	23.67 GB	1c67491281ac66f32ca917bc566808bf4657ad20dec137f2b40c78d95b3a40dd
B200	BF16	1	False	TensorRT-LLM	1	Throughput	30.03 GB	d9bab56236ea000a0e0760e76d2530d7e460dde85cb6487b496009cbbb9ea7b2
B200	BF16	2	False	TensorRT-LLM	2	Latency	31.1 GB	7b2460a744467de931be74543f3b1be6a4540edd8f5d3523c31aca253d3ee714
GB200	FP8	1	False	TensorRT-LLM	1	Throughput	23.53 GB	45999aa05eca849f11beb20bbf70dd48f2970a01dcf4c2617302b8038fca1662
GB200	FP8	2	False	TensorRT-LLM	2	Latency	23.67 GB	cfca3a90be399e2fc6b91dfe19aa141fe7db0ad114df365cf43d77c675711049
GB200	BF16	1	False	TensorRT-LLM	1	Throughput	37.05 GB	20ac8ad6d3cb6b083facc320669292ab06972015f432839c64c841494beb4177
GB200	BF16	2	False	TensorRT-LLM	2	Latency	31.1 GB	3b86e1c4eafac6dd59612ed5cea6878177867773d567fcc0e0127ad5b2b1221b
GH200 144GB HBM3e	FP8	1	False	TensorRT-LLM	1	Throughput	23.55 GB	c53b2566c86bf5ca53d25883b3f305f7fcfa1b8a854a52bf21d6901fec27fd3f
GH200 144GB HBM3e	FP8	2	False	TensorRT-LLM	2	Latency	23.72 GB	052a14156d375521d64c584a0197a00ab3c54ae742b55145f1df091072656de7
GH200 144GB HBM3e	BF16	1	False	TensorRT-LLM	1	Throughput	30.04 GB	8f0e0665f4f53a02e2e68ab690ba990d5d9a3a6b49cdcd4ff679031757de30ed
GH200 144GB HBM3e	BF16	2	False	TensorRT-LLM	2	Latency	31.13 GB	542edd6068b3fee7bb9431ba988f167dfc9f6e5b6dbf005b2e208d42bd17d705
GH200 120GB	FP8	1	False	TensorRT-LLM	1	Throughput	23.55 GB	8534f70d957abfcbb4353276dcb754f0a0820e90f555afe233be835f8d5c9743
GH200 120GB	FP8	1	False	TensorRT-LLM	1	Latency	23.55 GB	9784974c7d8bd2e2081fc751d02a70bd619b1b16d535d3ff764622c1b7e73ff2
GH200 120GB	BF16	1	False	TensorRT-LLM	1	Throughput	30.04 GB	076e5a3eec39670d90ef75725966fabc895074660b42a0ba3edce65c0eb15756
GH200 120GB	BF16	1	False	TensorRT-LLM	1	Latency	30.04 GB	f41c136a67469ae0bda89f96d78cb8c2b9c01c27d0ac618112248025320817c3
H200 NVL	FP8	1	False	TensorRT-LLM	1	Throughput	23.55 GB	7a4c46a6911fff7e11e99df87161eeae73b08c48cc75f1928e8289fbacc188fc
H200 NVL	FP8	2	False	TensorRT-LLM	2	Latency	23.71 GB	373b366ba1d0bf15e9c7087967fadcc00db30c079943feee2ae0e2489aee9d66
H200 NVL	BF16	1	False	TensorRT-LLM	1	Throughput	30.04 GB	58041a8daba4de24b75c86456d256fbc34e881de58ebdab6135aad18466d8130
H200 NVL	BF16	2	False	TensorRT-LLM	2	Latency	31.13 GB	dbb457d9b5a45d0a6976c0ba1a8ee6072deb8fe64c49a12e47ba9c71863618d2
H200	FP8	1	False	TensorRT-LLM	1	Throughput	23.55 GB	908a7d5dbabe5e1a7956b66d90747b34d44edb357546ab098474efc28a1f95f2
H200	FP8	2	False	TensorRT-LLM	2	Latency	23.72 GB	5a708fe91514e2aa44438158af955b35d51fab4ca1fb7268e35930e67fce6e08
H200	BF16	1	False	TensorRT-LLM	1	Throughput	37.07 GB	1b953829736de1f2ffc00c8d25316a24d43ed398b5f8599957c9547007fd5471
H200	BF16	2	False	TensorRT-LLM	2	Latency	31.2 GB	158a13eff79873eb73689daf87c365fa06946f74856646e54edc69728ef59a8e
H100 NVL	FP8	1	False	TensorRT-LLM	1	Throughput	23.55 GB	f3fd46462fe31ea44e5207d9f97ea190d8928b5688f5752445bb7dfd8afc999c
H100 NVL	FP8	2	False	TensorRT-LLM	2	Latency	23.66 GB	5e8a78e4d0c9e2e513466ec23ac181ae8d75ce05bda5c4653eddf8f3a99f2d58
H100 NVL	BF16	1	False	TensorRT-LLM	1	Throughput	30.04 GB	dbf1e0e5f43bcabfdd0791b0ffa83036fd7cf6be28fd5266990cf94bb8ffc338
H100 NVL	BF16	2	False	TensorRT-LLM	2	Latency	31.13 GB	a91cc87e8e98c7e88967319c273392e447fab72dd22aa8231630b573284525b2
H100 80GB SXM	FP8	1	False	TensorRT-LLM	1	Throughput	23.56 GB	21879fb81cb016c598cb4324a7cfa93acac1904ebc8cbfd9fff42d79a929b19d
H100 80GB SXM	FP8	2	False	TensorRT-LLM	2	Latency	23.72 GB	3a8d53337b395483663f3d96e4528a759d1474c5afa8b68d74a39e99fad7aebf
H100 80GB SXM	BF16	1	False	TensorRT-LLM	1	Throughput	30.04 GB	dfe2b2518ce82cb404a639f9039d956ec1b8f35703e7594d066ef080d1172156
H100 80GB SXM	BF16	2	False	TensorRT-LLM	2	Latency	31.13 GB	2f69689bf8fef4118bb018bb07869fc2d4b6eb3185115b2117ad62150f5d0006
A100 80GB SXM	BF16	1	False	TensorRT-LLM	1	Throughput	30.16 GB	8125bb2032b5649b5152192498543b77543de174d39f18dfb7473a6332c9b3fd
A100 80GB SXM	BF16	2	False	TensorRT-LLM	2	Latency	31.13 GB	f5a41523d4dd3de5340276b18dfcbce17bb047b2457a36c1310e61589cc28935
A100 40GB SXM	BF16	1	False	TensorRT-LLM	1	Throughput	30.02 GB	3f24e5c0ba016486c057ff9997caa9341f1f974f5dbf53e96fdbf9a0558bc9ab
A100 40GB SXM	BF16	2	False	TensorRT-LLM	2	Latency	31.45 GB	bd225e66799ed1e826ae5e2d7db017e1ba9a73855807e4f362967f22b2d7fecb
A10G	BF16	4	False	TensorRT-LLM	4	Throughput	33.73 GB	105f6654f47c50206ac6dc9662aa774e1d8bad4b7268e1db3358ac0e446d9abc
A10G	BF16	8	False	TensorRT-LLM	8	Latency	37.65 GB	80b75e35feecbcc4f3cc33ea634318bff503afcea1924a140f9fe9c3a848b99c
L40S	FP8	1	False	TensorRT-LLM	1	Throughput	23.55 GB	032ae037e7cddf2e43f37341027d8060e5eb5082797ed3e39d803f1c6af4eb84
L40S	FP8	2	False	TensorRT-LLM	2	Latency	23.7 GB	f282d4039fc42e3ab8a69854daf1a3a9e0fdce7974d06c3924969e3196e4ac08
L40S	BF16	1	False	TensorRT-LLM	1	Throughput	30.03 GB	7dedab87edfe691a1eb95d7bdc3fa74cc3e7fb6eeb6b56ab933f2ea3561d4c8d
L40S	BF16	2	False	TensorRT-LLM	2	Latency	31.11 GB	260a963986f4b26d55e2af38ef19fd45341314564cc55f6b3f968e0323a8daea
RTX PRO 6000 Blackwell Server Edition	FP8	1	False	TensorRT-LLM	1	Latency	23.56 GB	9325fdaeebf47da133de2bffcbfdcfb7537de5f0cf72adb091fb5bb8429047ad
RTX PRO 6000 Blackwell Server Edition	FP8	1	False	TensorRT-LLM	1	Throughput	23.56 GB	b5938999f25208fbf5ef64ad1ce0d3978ac52d5b18d32f9b7e1551cbc97737db
RTX PRO 6000 Blackwell Server Edition	BF16	1	False	TensorRT-LLM	1	Throughput	37.16 GB	86edc198bd7a53981bd95f612e07e204727e116bdf779f1405e0c8ff92e0f067
RTX PRO 6000 Blackwell Server Edition	BF16	2	False	TensorRT-LLM	2	Latency	31.36 GB	5f2aea268787b1aea54e98d370f99685f2d6066310615ff946cf4f76242dcbfc

Generic Model Profiles#

Precision	# of GPUs	LoRA	LLM Engine	Disk Space	Profile ID
BF16	1	False	vLLM	14.97 GB	47804570e8db67fdc60d9cb518b0f60316ac4eb0f3a55af81aafefe124505a8f
BF16	2	False	vLLM	14.97 GB	60cc64256f69a6242aa2b7320a1070ae84e88fe5416674cbbe9a2cb6ecb8e992
BF16	4	False	vLLM	14.97 GB	511022fb9a93b2c8b1eb84f4c472f05833e6904e51de7fd676be168a0aa9d63d
BF16	8	False	vLLM	14.97 GB	a98cb8c244d2fd95e65a102758e4b039c94864f322c5b64bc9ee5749ddac59ab