Supported Models for NVIDIA NIM for LLMs#

Use this documentation to learn details of supported models for LLM-specific NIM containers.

Note

For supported models for the multi-LLM NIM container, refer to Supported Architectures for Multi-LLM NIM instead.

GPUs#

The GPU listed in the following sections have the following specifications.

GPU	Family	Memory
DGX B200		1.4 TB
DGX GB200
DGX Spark		128 GB
GH200		96 GB
GH200	NVL	141 GB
H200	SXM	141 GB
H100	SXM	80 GB
H100	NVL	94 GB
A100	SXM	80 GB
L40S	PCIe	48 GB
A10G	PCIe	24 GB
H20		96 GB
NVIDIA RTX PRO 6000 Blackwell Server Edition		96 GB
NVIDIA RTX 6000 Ada Generation		48 GB
GeForce RTX 5090		32 GB
GeForce RTX 5080		16 GB
GeForce RTX 4090		24 GB
GeForce RTX 4080		16 GB

Important

NVIDIA AI Enterprise (NVAIE) infrastructure does not support the RTX 4090 GPU. For more information on GPUs supported by NVAIE, see the Supported NVIDIA GPUs and Networking section for a given release of NVIDIA AI Enterprise.

Note

NVIDIA NIM for LLMs supports Multi-Instance GPU (MIG) mode to partition supported NVIDIA GPUs into multiple isolated instances. This feature works best with smaller parameter models (less than or equal to 8 billion parameters). For setup instructions and performance trade-off considerations, refer to the Multi-Instance GPU (MIG) User Guide.

Optimized Models#

The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environments, the GPU memory values in the following sections refer to the total GPU memory, including the reserved GPU memory for vGPU setup.

NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build or vllm in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.

The following table shows the supported versions and features for each model.

Click a model name to view its hardware requirements.
Click a model ID to open its catalog page for more details.

Supported Models for NVIDIA NIM for LLMs#
Model (Hardware Requirements)	Organization	Organization/Model ID (Catalog Page)	Versions Supported	LoRA Support	Tool Calling Support	Parallel Tool Calling Support	Suffix Support
Bielik 11B v2.3 Instruct	SpeakLeash	`speakleash/bielik-11b-v2.3-instruct`	`1.8.4`	No	No	No	No
Code Llama 13B Instruct	Meta	`meta/codellama-13b-instruct`	`1.0`, `1.2`, `1.2.3`	-	-	-	Yes
Code Llama 34B Instruct	Meta	`meta/codellama-34b-instruct`	`1.0`, `1.2`, `1.2.3`	-	-	-	Yes
Code Llama 70B Instruct	Meta	`meta/codellama-70b-instruct`	`1.0`, `1.2`, `1.2.3`	-	-	-	Yes
DeepSeek Coder V2 Lite Instruct	DeepSeek	`deepseek-ai/deepseek-coder-v2-lite-instruct`	`1.10.1`	No	No	No	No
DeepSeek R1	DeepSeek	`deepseek-ai/deepseek-r1`	`1.7`, `1.7.3`	No	No	No	Yes
DeepSeek R1 Distill Llama 8B	DeepSeek	`deepseek-ai/deepseek-r1-distill-llama-8b`	`1.5`	-	-	-	Yes
DeepSeek R1 Distill Llama 70B	DeepSeek	`deepseek-ai/deepseek-r1-distill-llama-70b`	`1.5`	-	-	-	Yes
DeepSeek R1 Distill Llama 8B RTX	DeepSeek	`deepseek-ai/deepseek-r1-distill-llama-8b`	`1.8`	-	-	-	Yes
DeepSeek R1 Distill Qwen 32B	DeepSeek	`deepseek-ai/deepseek-r1-distill-qwen-32b`	`1.8`	-	-	-	Yes
EuroLLM 9B Instruct	UTTER project	`utter-project/eurollm-9b-instruct`	`1.8.4`, `1.8.5`	No	No	No	No
Gemma 2 2B	Google	`google/gemma-2-2b-instruct`	`1.4`	-	-	-	No
Gemma 2 9B	Google	`google/gemma-2-9b-it`	`1.4.0`	-	-	-	No
Gemma2 9B CPT Sahabat-AI v1 Instruct	GoToCompany	`gotocompany/gemma2-9b-cpt-sahabatai-v1-instruct`	`1.8.4`	Yes	No	No	No
Gemma 3 1B Instruct	Google	`google/gemma-3-1b-it`	`1.12.0`	No	Yes	No	No
GPT-OSS-20B	OpenAI	`openai/gpt-oss-20b`	`1.12.1`, `1.12.3`, `1.12.4`	No	Yes	No	No
GPT-OSS-120B	OpenAI	`openai/gpt-oss-120b`	`1.12.1`, `1.12.3`, `1.12.4`	No	Yes	No	No
Granite 3.3 8B Instruct	IBM	`ibm-granite/granite-3.3-8b-instruct`	`1.8.4`	No	No	No	No
GreenMind Medium 14B R1	GreenNode	`greennode/greenmind-medium-14b-r1`	`1.10.1`	No	No	No	No
Kanana 1.5 8B Instruct 2505	Kakao Corp	`kakaocorp/kanana-1.5-8b-instruct-2505`	`1.10.1`	No	Yes	No	No
(Meta) Llama 2 7B Chat	Meta	`meta/llama-2-7b-chat`	`1.0`, `1.0.3`	-	-	-	No
(Meta) Llama 2 13B Chat	Meta	`meta/llama-2-13b-chat`	`1.0`, `1.0.3`	H100, A100, L40S	-	-	No
(Meta) Llama 2 70B Chat	Meta	`meta/llama-2-70b-chat`	`1.0`, `1.0.3`	-	-	-	No
Llama 3 SQLCoder 8B	Meta	`defog/llama-3-sqlcoder-8b`	`1.2.3`	-	-	-	No
Llama 3 Swallow 70B Instruct V0.1	Meta	`tokyotech-llm/llama-3-swallow-70b-instruct-v0.1`	`1.0`, `1.2`, `1.1.2`	-	-	-	No
Llama 3 Taiwan 70B Instruct	Meta	`yentinglin/llama-3-taiwan-70b-instruct`	`1.0`, `1.1`, `1.1.2`	-	-	-	No
Llama 3.1 8B Base	Meta	`meta/llama-3.1-8b-base`	`1.0`, `1.1`, `1.1.1`, `1.1.2`	-	Yes	Yes	No
Llama 3.1 8B Instruct	Meta	`meta/llama-3.1-8b-instruct`	`1.0`, `1.1`, `1.1.1`, `1.1.2`, `1.2`, `1.2.3`, `1.3`, `1.3.3`, `1.8`, `1.8.3`, `1.8.4`, `1.8.5`, `1.8.6`, `1.10.1`, `1.12.0`, `1.13.1`	Yes	Yes	Yes	No
Llama-3.1-8b-Instruct-DGX-Spark	Meta	`meta/llama-3.1-8b-instruct-dgx-spark`	`1.0.0-variant`	No	No	No	No
Llama 3.1 8B Instruct RTX	Meta	`meta/llama-3.1-8b-instruct`	`1.8.0-RTX`	No	Yes	Yes	No
Llama 3.1 70B Instruct	Meta	`meta/llama-3.1-70b-instruct`	`1.0`, `1.1`, `1.1.1`, `1.1.2`, `1.2`, `1.2.1`, `1.3`, `1.8.3`, `1.8.4`, `1.8.5`, `1.10.1`, `1.13.1`, `1.14.0`	Yes	Yes	No	No
Llama 3.1 405B Instruct	Meta	`meta/llama-3.1-405b-instruct`	`1.0`, `1.1`, `1.1.2`, `1.2`, `1.3`	-	Yes	Yes	No
Llama 3.1 Nemotron Nano 4B V1.1	NVIDIA	`nvidia/llama3.1-nemotron-nano-4b-v1.1`	`1.8.4`, `1.8.5`	Yes	Yes	No	No
Llama 3.1 Nemotron Nano 8B V1	NVIDIA	`nvidia/llama-3.1-nemotron-nano-8b-v1`	`1.6.0`, `1.8.3`, `1.8.4`	No	Yes	-	No
Llama 3.1 Nemotron Ultra 253B V1	NVIDIA	`nvidia/llama-3.1-nemotron-ultra-253b-v1`	`1.8.4`, `1.12.0`	No	Yes	Yes	No
Llama 3.1 Nemotron 70B Instruct	NVIDIA	`nvidia/llama-3.1-nemotron-70b-instruct`	`1.0`, `1.1`, `1.1.1`, `1.2`, `1.2.1`, `1.2.3`	-	-	-	No
Llama 3.1 Swallow 8B Instruct v0.1	Meta	`tokyotech-llm/llama-3.1-swallow-8b-instruct-v0.1`	`1.3`	-	-	-	No
Llama 3.1 Swallow 70B Instruct v0.1	Meta	`tokyotech-llm/llama-3.1-swallow-70b-instruct-v0.1`	`1.3`	-	-	-	No
Llama 3.1 Typhoon 2 8B Instruct	SCB 10X	`scb10x/llama3.1-typhoon2-8b-instruct`	`1.8.5`	Yes	Yes	No	No
Llama 3.1 Typhoon 2 70B Instruct	SCB 10X	`scb10x/llama-3.1-typhoon2-70b-instruct`	`1.10.1`	Yes	Yes	No	No
Llama 3.2 1B Instruct	Meta	`meta/llama-3.2-1b-instruct`	`1.6.0`, `1.8.1`, `1.8.3`, `1.8.5`, `1.8.6`, `1.10.1`, `1.12.0`	Yes	Yes	Yes	No
Llama 3.2 3B Instruct	Meta	`meta/llama-3.2-3b-instruct`	`1.6.0`, `1.8.3`, `1.8.4`, `1.8.5`, `1.8.6`, `1.10.1`	Yes	Yes	No	No
Llama 3.3 70B Instruct	Meta	`meta/llama-3.3-70b-instruct`	`1.5.2`, `1.8.2`, `1.8.3`, `1.8.4`, `1.8.5`, `1.10.1`, `1.12.0`, `1.13.1`, `1.14.0`	Yes	Yes	No	No
Llama 3.3 Nemotron Super 49B V1	NVIDIA	`nvidia/llama-3.3-nemotron-super-49b-v1`	`1.8.3`, `1.8.4`, `1.8.5`, `1.8.6`, `1.10.1`	No	Yes	No	No
Llama 3.3 Nemotron Super 49B V1.5	NVIDIA	`nvidia/llama-3.3-nemotron-super-49b-v1.5`	`1.12.0`, `1.13.1`, `1.14.0`	Yes	Yes	Yes	No
Meta Llama 3 8B Instruct	Meta	`meta/llama3-8b-instruct`	`1.0`, `1.0.3`	-	-	-	No
Meta Llama 3 70B Instruct	Meta	`meta/llama3-70b-instruct`	`1.0`, `1.0.1`, `1.0.3`	-	-	-	No
Mistral 7B Instruct V0.3	Mistral	`mistralai/mistral-7b-instruct-v0.3`	`1.0`, `1.1`, `1.1.2`, `1.3`, `1.8.4`, `1.12.0`	Yes	Yes	No	No
Mistral NeMo 12B Instruct RTX	Mistral	`nv-mistralai/mistral-nemo-12b-instruct`	`1.8.0-rtx`, `1.8.4-rtx`	-	-	-	No
Mistral NeMo 12B Instruct	Mistral	`nv-mistralai/mistral-nemo-12b-instruct`	`1.0`, `1.2`, `1.2.3`	-	-	-	No
Mistral NeMo Minitron 8B 8K Instruct	Mistral	`nv-mistralai/mistral-nemo-minitron-8b-8k-instruct`	`1.2.3`	Yes	-	-	No
Mistral Small 24b Instruct 2501	Mistral	`mistralai/mistral-small-24b-instruct-2501`	`1.8.4`	No	No	No	No
Mixtral 8x7B Instruct V0.1	Mistral	`mistralai/mixtral-8x7b-instruct-v0-1`	`1.0`, `1.2`, `1.2.1`, `1.3`, `1.8.4`, `1.12.0`	Yes	No	No	No
Mixtral 8x22B Instruct V0.1	Mistral	`mistralai/mixtral-8x22b-instruct-v01`	`1.0`, `1.2`, `1.2.3`	-	Yes	Yes	No
Nemotron 4 340B Instruct	NVIDIA	`nvidia/nemotron-4-340b-instruct`	`1.0`, `1.1`, `1.1.2`	-	-	-	No
Nemotron 4 340B Reward	NVIDIA	`nvidia/nemotron-4-340b-reward`	`1.0`, `1.2`	-	Yes	Yes	No
NVIDIA-Nemotron-Nano-9B-v2	NVIDIA	`nvidia/nvidia-nemotron-nano-9b-v2`	`1.12.2`	No	Yes	Yes	No
NVIDIA-Nemotron-Nano-9B-v2-DGX-Spark	NVIDIA	`nvidia/nvidia-nemotron-nano-9b-v2-dgx-spark`	`1.0.0-variant`	No	Yes	Yes	No
Phi 3 Mini 4K Instruct	Microsoft	`microsoft/phi-3-mini-4k-instruct`	`1.2.3`	-	-	-	No
Phi 4 Mini Instruct	Microsoft	`microsoft/phi-4-mini-instruct`	`1.12.0`	No	Yes	Yes	No
Phind Codellama 34B V2 Instruct	Microsoft	`phind/phind-codellama-34b-v2-instruct`	`1.2.3`	-	-	-	No
Qwen3 Next 80B A3B Thinking	Alibaba Cloud	`qwen/qwen3-next-80b-a3b-thinking`	`1.0.0-variant`	No	Yes	No	No
Qwen3-32B NIM for DGX Spark	Alibaba Cloud	`qwen/qwen3-32b-dgx-spark`	`1.0.0-variant`	No	Yes	Yes	No
Qwen2.5 Coder 32B Instruct	Alibaba Cloud	`qwen/qwen2.5-coder-32b-instruct`	`1.8.5`	No	No	No	No
Qwen2.5 7B Instruct	Alibaba Cloud	`qwen/qwen-2.5-7b-instruct`	`1.4.0`	-	-	-	No
Riva Translate 4B Instruct	NVIDIA	`nvidia/riva_translate_4b_instruct`	`1.8.5`	No	No	No	No
Sarvam - M	Sarvam	`sarvamai/sarvam-m`	`1.8.5`	No	No	No	No
SILMA 9B Instruct v1.0	SILMA AI	`silma-ai/silma-9b-instruct-v1.0`	`1.8.4`	No	No	No	No
StarCoder2 7B	BigCode	`bigcode/starcoder2-7b`	`1.8.1`, `1.14.1`, `1.15.0`	No	No	No	Yes
StarCoderBase 15.5B	BigCode	`bigcode/starcoderbase-15b`	`1.5`	-	-	-	Yes
Stockmark-2-100B-Instruct	Stockmark Inc.	`stockmark/stockmark-2-100b-instruct`	`1.12.0`	Yes	No	No	No
Teuken 7B Instruct Commercial v0.4	OpenGPT-X	`opengpt-x/teuken-7b-instruct-commercial-v0.4`	`1.10.1`	No	No	No	No

Bielik 11B v2.3 Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: FP16
# of GPUs:
- 1 or 2 B200, H200, H100 SXM, H100 NVL, or A100
- 2 or 4 A100 40GB or L40S
- 4 or 8 A10G

Code Llama 13B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Throughput	2	24.63
H100 SXM	FP16	Latency	4	25.32
A100 SXM	FP16	Throughput	2	24.63
A100 SXM	FP16	Latency	4	25.31
L40S	FP16	Throughput	2	25.32
L40S	FP16	Latency	2	24.63
A10G	FP16	Throughput	4	25.32
A10G	FP16	Latency	8	26.69

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Code Llama 34B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	32.17
H100 SXM	FP8	Latency	4	32.42
H100 SXM	FP16	Throughput	2	63.48
H100 SXM	FP16	Latency	4	64.59
A100 SXM	FP16	Throughput	2	63.48
A100 SXM	FP16	Latency	4	64.59
L40S	FP8	Throughput	4	32.42
L40S	FP16	Throughput	4	64.58
A10G	FP16	Throughput	4	64.58
A10G	FP16	Latency	8	66.8

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Code Llama 70B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	4	65.47
H100 SXM	FP8	Latency	8	66.37
H100 SXM	FP16	Throughput	4	130.35
H100 SXM	FP16	Latency	8	66.37
A100 SXM	FP16	Throughput	4	130.35
A100 SXM	FP16	Latency	8	132.71
A10G	FP16	Throughput	8	132.69

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek Coder V2 Lite Instruct#

Supported Configurations#

The following configurations support this model:

1 GH200 480GB, B200, H200, H200 NVL, H100 SXM, H100 NVL, or A100 SXM
2 B200, H200, H200 NVL, H100 SXM, H100 NVL, or A100 SXM
4 A100 40GB or L40S
8 A100 40GB, A10G, or L40S

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1#

Supported Configurations#

The following configurations support this model:

1 node of [8 x H200] for 8 total H200 GPUs
2 nodes of [8 x H100] for 16 total H100 GPUs
2 nodes of [8 x H20] for 16 total H20 GPUs

Refer to the NGC catalog entry for further information.

DeepSeek R1 Distill Llama 8B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in.

GPU	Precision	Profile	# of GPUs	Disk Space
H200 SXM	FP8	Throughput	1	8.58
H200 SXM	FP8	Latency	2	8.72
H200 SXM	BF16	Throughput	1	15.05
H200 SXM	BF16	Latency	2	16.12
H100 SXM	FP8	Throughput	1	8.58
H100 SXM	FP8	Latency	2	8.74
H100 SXM	BF16	Throughput	1	15.05
H100 SXM	BF16	Latency	2	16.12
H100 NVL	FP8	Throughput	1	8.58
H100 NVL	FP8	Latency	2	8.73
H100 NVL	BF16	Latency	2	16.12
H100 NVL	BF16	Throughput	1	15.0
A100 SXM	BF16	Throughput	1	15.16
A100 SXM	BF16	Latency	2	16.36
L40S	FP8	Throughput	1	8.58
L40S	FP8	Latency	2	8.71
L40S	BF16	Throughput	1	15.14
L40S	BF16	Latency	2	16.32
A10G	BF16	Throughput	2	16.12
A10G	BF16	Latency	4	18.25

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1 Distill Llama 70B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	FP8	Latency	4	68.66
H200	FP8	Throughput	2	68.12
H200	BF16	Latency	8	146.18
H200	BF16	Throughput	4	137.77
H100	FP8	Latency	4	68.65
H100	FP8	Throughput	2	68.18
H100	FP8	Latency	8	69.6
H100	BF16	Latency	8	146.18
H100	BF16	Throughput	4	137.77
A100	BF16	Latency	8	146.19
A100	BF16	Throughput	4	137.82
L40S	FP8	Throughput	4	68.57

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek R1 Distill Llama 8B RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
NVIDIA RTX 6000 Ada Generation	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5080	INT4 AWQ	Throughput	1	5.42
GeForce RTX 4090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 4080	INT4 AWQ	Throughput	1	5.42

DeepSeek R1 Distill Qwen 32B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200 SXM	BF16	Throughput	1	61.19
H100 SXM	BF16	Throughput	1	61.19
H200 SXM	BF16	Throughput	2	62.77
H20 96GB	BF16	Throughput	1	61.19
A100	BF16	Throughput	1	61.18
L40S	BF16	Throughput	2	62.79
L20	BF16	Throughput	2	62.8
L40S	FP8	Throughput	2	32.49
H200 SXM	FP8	Throughput	1	32.15
H200 SXM	FP8	Throughput	2	32.45
H100 SXM	FP8	Throughput	1	32.14
H20	FP8	Throughput	1	32.12
L20	FP8	Throughput	1	32.16

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

DeepSeek-R1-Distill-Qwen-7B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200 SXM	FP8	Throughput	1	15.85
H200 SXM	BF16	Throughput	1	21.93
H100 SXM	FP8	Throughput	1	15.84
H100 SXM	BF16	Throughput	1	21.94
H20 96GB	FP8	Throughput	1	15.83
H20 96GB	BF16	Throughput	1	22.00
L20	FP8	Throughput	1	15.84
L20	BF16	Throughput	1	21.97
A100 80GB	BF16	Throughput	1	21.98
A10G	BF16	Throughput	1	21.97

DeepSeek-R1-Distill-Qwen-14B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H20 96GB	FP8	Throughput	1	22.52
H20 96GB	BF16	Throughput	1	34.98
L20	FP8	Throughput	1	22.54
L20	BF16	Throughput	1	34.96
H100 SXM	FP8	Throughput	1	22.54
H200 SXM	FP8	Throughput	1	22.54
H200 SXM	BF16	Throughput	1	34.87
L40S	FP8	Throughput	1	22.55

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1

EuroLLM 9B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2

Qwen3 Next 80B A3B Thinking#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	BF16	Throughput	4	163.2
H200	BF16	Throughput	2	163.2
H20 141GB	BF16	Throughput	4	163.2
H20 141GB	BF16	Throughput	2	163.2
H20 96GB	BF16	Throughput	4	163.2
H100 SXM	BF16	Throughput	4	163.2

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Known Functional Differences#

This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.

Qwen3-32B NIM for DGX Spark#

This NIM only runs on DGX Spark.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
DGX Spark	NVFP4	-	1	41.6

Known Functional Differences#

This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.

Qwen2.5 Coder 32B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 B200 or H200
- 2 or 4 H100 SXM, H100 NVL, or A100
- 4 or 8 A100 40GB or L40S
- 8 A10G

Qwen2.5 72B Instruct#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H20	FP8	Throughput	4	77.71
H20	FP8	Throughput	8	77.96
H20	FP8	Latency	4	78.22
H20	FP8	Latency	8	78.98
L20	FP8	Throughput	4	78.14
L20	FP8	Throughput	8	79.15
L20	FP8	Latency	4	78.14
L20	FP8	Latency	8	78.89
A100 SXM	BF16	Throughput	4	150.35
A100 SXM	BF16	Latency	8	160.18

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Qwen2.5 7B Instruct#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
L20	FP16	Throughput	1	21.66
A100 PCIe 40GB	FP16	Latency	1	21.66
A100 PCIe 40GB	BF16	Throughput	1	21.66
A100 PCIe 40GB	FP16	Balanced	1	21.66
A100 SXM/NVLink	FP16	Latency	1	21.66
A100 SXM/NVLink	BF16	Throughput	1	21.66
A100 SXM/NVLink	BF16	Balanced	1	21.66

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16,FP16
# of GPUs: 1

Gemma 2 2B#

This model supports LoRA and vGPU.

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2

Gemma 2 9B#

This model supports LoRA and vGPU.

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2, or 4

Gemma2 9B CPT Sahabat-AI v1 Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1

Gemma 3 1B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported vLLM Profiles#

Precision: BF16
# of GPUs:
- 1 GH200 480GB
- 1 or 2 B200, H200 SXM, H200 NVL, GB200, GH200 144GB, H100 SXM, H100 NVL, A100 40GB, A100, A10G, L40S, or NVIDIA RTX PRO 6000 Blackwell Server Edition

GPT-OSS-20B#

This model was tested with the following configurations:

GPU	Precision	# of GPUs
B200 SXM/NVLink	MXFP4	1, 2, 4, 8
H200 SXM/NVLink	MXFP4	1, 2, 4, 8
GB200	MXFP4	1, 2, 4
GH200 480GB	MXFP4	1, 2
GH200 144GB	MXFP4	1, 2
H100 SXM/NVLink	MXFP4	1, 2, 4, 8
A100 80GB	MXFP4	1, 2, 4, 8
A100 40GB	MXFP4	1, 2, 4, 8
L40S	MXFP4	1, 2, 4, 8
A10G	MXFP4	1, 2, 4, 8
H20	MXFP4	1, 2, 4, 8
NVIDIA RTX PRO 6000 Blackwell Server Edition	MXFP4	1, 2, 4, 8

GPT-OSS-120B#

This model was tested with the following configurations:

GPU	Precision	# of GPUs
B200 SXM/NVLink	MXFP4	1, 2, 4, 8
H200 SXM/NVLink	MXFP4	1, 2, 4, 8
H100 SXM/NVLink	MXFP4	1, 2, 4, 8
GB200	MXFP4	1, 2, 4
GH200 480GB	MXFP4	1, 2
GH200 144GB	MXFP4	1, 2
A100 80GB	MXFP4	1, 2, 4, 8
A100 40GB	MXFP4	1, 2, 4, 8
L40S	MXFP4	2, 4, 8
A10G	MXFP4	1, 2, 4, 8
H20	MXFP4	1, 2, 4, 8
NVIDIA RTX PRO 6000 Blackwell Server Edition	MXFP4	1, 2, 4, 8

Granite 3.3 8B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 B200, H200, H100 SXM, H100 NVL, A100 40GB, A100
- 2 or 4 A10G

GreenMind Medium 14B R1#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported Configurations#

The following configurations support this model:

Precision: BF16
# of GPUs:
- 1, 2, 4, or 8 B200, H200, H100 SXM, H100 NVL, A100 40GB, A100, L40S, A10G

Kanana 1.5 8B Instruct 2505#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 B200, H200, H100 SXM, H100 NVL, A100, A10G, or L40S
- 4 A10G or L40S
- 8 A10G

(Meta) Llama 2 7B Chat#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	6.57
H100 SXM	FP8	Latency	2	6.66
H100 SXM	FP16	Throughput	1	12.62
H100 SXM	FP16	Throughput LoRA	1	12.63
H100 SXM	FP16	Latency	2	12.93
A100 SXM	FP16	Throughput	1	15.54
A100 SXM	FP16	Throughput LoRA	1	12.63
A100 SXM	FP16	Latency	2	12.92
L40S	FP8	Throughput	1	6.57
L40S	FP8	Latency	2	6.64
L40S	FP16	Throughput	1	12.64
L40S	FP16	Throughput LoRA	1	12.65
L40S	FP16	Latency	2	12.95

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

(Meta) Llama 2 13B Chat#

This model supports LoRA on H100, A100, L40S.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	2	12.6
H100 SXM	FP16	Throughput	1	24.33
H100 SXM	FP16	Throughput LoRA	1	24.35
H100 SXM	FP16	Latency	2	24.71
A100 SXM	FP16	Throughput	1	24.34
A100 SXM	FP16	Throughput LoRA	1	24.37
A100 SXM	FP16	Latency	2	24.74
L40S	FP8	Throughput	1	12.49
L40S	FP8	Latency	2	12.59
L40S	FP16	Throughput	1	24.33
L40S	FP16	Throughput LoRA	1	24.37
L40S	FP16	Latency	2	24.7

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

(Meta) Llama 2 70B Chat#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	65.08
H100 SXM	FP8	Latency	4	65.36
H100 SXM	FP16	Throughput	4	130.52
H100 SXM	FP16	Throughput LoRA	4	130.6
H100 SXM	FP16	Latency	8	133.18
A100 SXM	FP16	Throughput	4	130.52
A100 SXM	FP16	Throughput LoRA	4	130.5
A100 SXM	FP16	Latency	8	133.12
L40S	FP8	Throughput	4	63.35

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 SQLCoder 8B#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk space
H100 SXM	FP8	Throughput	1	8.52
H100 SXM	FP8	Latency	2	8.61
H100 SXM	FP16	Throughput	1	15
H100 SXM	FP16	Latency	2	16.02
L40S	FP8	Throughput	1	8.53
L40S	FP8	Latency	2	8.61
L40S	FP16	Throughput	1	15
L40S	FP16	Latency	2	16.02
A10G	FP16	Throughput	1	15
A10G	FP16	Throughput	2	16.02
A10G	FP16	Latency	2	16.02
A10G	FP16	Latency	4	18.06

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Swallow 70B Instruct V0.1#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	68.42
H100 SXM	FP8	Latency	4	69.3
H100 SXM	FP16	Throughput	2	137.7
H100 SXM	FP16	Latency	4	145.94
A100 SXM	FP16	Throughput	2	137.7
A100 SXM	FP16	Latency	2	137.7
L40S	FP8	Throughput	2	68.48
A10G	FP16	Throughput	4	145.93

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3 Taiwan 70B Instruct#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	68.42
H100 SXM	FP8	Latency	4	145.94
H100 SXM	FP16	Throughput	2	137.7
H100 SXM	FP16	Latency	4	137.7
A100 SXM	FP16	Throughput	2	137.7
A100 SXM	FP16	Latency	2	145.94
L40S	FP8	Throughput	2	68.48
A10G	FP16	Throughput	4	145.93

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 8B Base#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized.

GPU	Precision	Profile	# of GPUs
H100 SXM	BF16	Latency	2
H100 SXM	FP8	Latency	2
H100 SXM	BF16	Throughput	1
H100 SXM	FP8	Throughput	1
A100 SXM	BF16	Latency	2
A100 SXM	BF16	Throughput	1
L40S	BF16	Latency	2
L40S	BF16	Throughput	2
A10G	BF16	Latency	4
A10G	BF16	Throughput	2

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Llama 3.1 8B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	NVFP4	Throughput	1	5.72
B200	NVFP4	Latency	2	5.84
B200	FP8	Throughput	1	8.56
B200	FP8	Throughput LoRA	1	8.61
B200	FP8	Latency	2	8.7
B200	BF16	Throughput	1	15.03
B200	BF16	Throughput LoRA	1	15.04
B200	BF16	Latency	2	16.13
H200	FP8	Throughput	1	8.58
H200	FP8	Throughput LoRA	1	8.63
H200	FP8	Latency	2	8.74
H200	BF16	Throughput	1	15.07
H200	BF16	Throughput LoRA	1	15.08
H200	BF16	Latency	2	16.15
H200 NVL	FP8	Throughput	1	8.58
H200 NVL	FP8	Throughput LoRA	1	8.63
H200 NVL	FP8	Latency	2	8.74
H200 NVL	BF16	Throughput	1	15.07
H200 NVL	BF16	Throughput LoRA	1	15.08
H200 NVL	BF16	Latency	2	16.15
GH200 480GB	FP8	Throughput	1	8.58
GH200 480GB	FP8	Throughput LoRA	1	8.62
GH200 480GB	FP8	Latency	1	8.58
GH200 480GB	BF16	Throughput	1	15.07
GH200 480GB	BF16	Throughput LoRA	1	15.08
GH200 480GB	BF16	Latency	1	15.07
GH200 144GB	FP8	Throughput	1	8.58
GH200 144GB	FP8	Throughput LoRA	1	8.63
GH200 144GB	FP8	Latency	1	8.58
GH200 144GB	BF16	Throughput	1	15.07
GH200 144GB	BF16	Throughput LoRA	1	15.08
GH200 144GB	BF16	Latency	1	15.07
GB200	NVFP4	Throughput	1	5.72
GB200	NVFP4	Latency	2	5.84
GB200	FP8	Throughput	1	8.56
GB200	FP8	Throughput LoRA	1	8.61
GB200	FP8	Latency	2	8.7
GB200	BF16	Throughput	1	15.05
GB200	BF16	Throughput LoRA	1	15.06
GB200	BF16	Latency	2	16.13
H100	FP8	Throughput	1	8.58
H100	FP8	Throughput LoRA	1	8.63
H100	FP8	Latency	2	8.75
H100	BF16	Throughput	1	15.06
H100	BF16	Throughput LoRA	1	15.07
H100	BF16	Latency	2	16.15
H100 NVL	FP8	Throughput	1	8.58
H100 NVL	FP8	Throughput LoRA	1	8.63
H100 NVL	FP8	Latency	2	8.74
H100 NVL	BF16	Throughput	1	15.06
H100 NVL	BF16	Throughput LoRA	1	15.07
H100 NVL	BF16	Latency	2	16.15
A100	BF16	Throughput	1	15.07
A100	BF16	Throughput LoRA	1	15.08
A100	BF16	Latency	2	16.14
A100 40GB	BF16	Throughput	1	15.05
A100 40GB	BF16	Throughput LoRA	1	15.06
A100 40GB	BF16	Latency	2	16.16
L40S	FP8	Throughput	1	8.58
L40S	FP8	Throughput LoRA	1	8.61
L40S	FP8	Latency	2	8.72
L40S	BF16	Throughput	1	15.05
L40S	BF16	Throughput LoRA	1	15.06
L40S	BF16	Latency	2	16.12
A10G	BF16	Throughput	4	18.32
A10G	BF16	Throughput LoRA	4	18.39
A10G	BF16	Latency	4	18.32
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Throughput	1	5.73
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Latency	1	5.73
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	1	8.59
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput LoRA	1	8.64
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	1	8.59
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	1	15.05
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput LoRA	1	15.05
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	1	15.05

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	15

Llama-3.1-8b-Instruct-DGX-Spark#

Supported vLLM Profiles#

Precision: FP8
# of GPUs:
- 1 DGX Spark

Known Functional Differences#

This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.

Llama 3.1 8B Instruct RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
NVIDIA RTX 6000 Ada Generation	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 5080	INT4 AWQ	Throughput	1	5.41
GeForce RTX 4090	INT4 AWQ	Throughput	1	5.42
GeForce RTX 4080	INT4 AWQ	Throughput	1	5.42

Llama 3.1 Nemotron Nano 4B V1.1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space (GB)
H200	BF16	Throughput	1	8.55
H200	BF16	Throughput LoRA	1	8.60
GH200 480GB	BF16	Throughput	1	8.51
H100	BF16	Throughput	1	8.51
H100	BF16	Throughput LoRA	1	8.52
H100 NVL	BF16	Throughput	1	8.51
H100 NVL	BF16	Throughput LoRA	1	8.52
A100 40GB	BF16	Throughput	1	8.61
A100 40GB	BF16	Throughput LoRA	1	8.61
A100	BF16	Throughput	1	8.62
A100	BF16	Throughput LoRA	1	8.61
L40S	FP8	Throughput	1	5.08
L40S	BF16	Throughput	1	8.51
A10G	BF16	Throughput	1	8.60
A10G	BF16	Throughput LoRA	1	8.66

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 B200, H200, GH200 480GB, H100 SXM, H100 NVL, A100 40GB, A100, L40S, or A10G
Precision: FP8
# of GPUs:
- 1 L40S

Llama 3.1 Nemotron Nano 8B V1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	FP8	Throughput	1	8.58
H200	FP8	Latency	2	8.73
H200	BF16	Throughput	1	15.06
H200	BF16	Throughput LoRA	1	15.07
H200	BF16	Latency	2	16.12
H100	FP8	Throughput	1	8.58
H100	FP8	Latency	2	8.74
H100	BF16	Throughput	1	15.06
H100	BF16	Throughput LoRA	1	15.07
H100	BF16	Latency	2	16.12
H100 NVL	FP8	Throughput	1	8.58
H100 NVL	FP8	Throughput LoRA	1	8.63
H100 NVL	FP8	Latency	2	8.73
H100 NVL	BF16	Throughput	1	15.06
H100 NVL	BF16	Throughput LoRA	1	15.07
H100 NVL	BF16	Latency	2	16.12
A100	BF16	Throughput	1	15.06
A100	BF16	Throughput LoRA	1	15.07
A100	BF16	Latency	2	16.12
L40S	FP8	Throughput	1	8.6
L40S	FP8	Latency	2	8.75
L40S	BF16	Throughput	1	15.06
L40S	BF16	Throughput	2	16.15
L40S	BF16	Latency	2	16.12
L40S	BF16	Latency	4	18.26
A10G	BF16	Throughput	2	16.12

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 or 2 H200, H100 SXM, H100 NVL, or A100 GPUs
- 2 or 4 L40S or A10G GPUs

Llama 3.1 Nemotron Ultra 253B V1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	8	241.5
B200	FP8	Latency	8	241.51
H200	FP8	Throughput	8	242.11
H200	FP8	Latency	8	242.11
H200 NVL	FP8	Throughput	8	242.1
H200 NVL	FP8	Latency	8	242.1
H200 NVL	BF16	Throughput	8	500.73
H200 NVL	BF16	Latency	8	500.73
H100	FP8	Throughput	8	242.04
H100	FP8	Latency	8	242.07
H100	BF16	Throughput	8	500.73
H100	BF16	Latency	8	500.73
H100 NVL	FP8	Throughput	8	242.13
H100 NVL	FP8	Latency	8	242.13
H100 NVL	BF16	Throughput	8	500.73
H100 NVL	BF16	Latency	8	500.73

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.2 1B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	1.94
B200	FP8	Throughput LoRA	1	1.96
B200	FP8	Latency	2	2.0
B200	BF16	Throughput	1	2.84
B200	BF16	Throughput LoRA	1	2.84
B200	BF16	Latency	2	3.38
H200	FP8	Throughput	1	1.94
H200	FP8	Throughput LoRA	1	1.97
H200	FP8	Latency	2	2.02
H200	BF16	Throughput	1	2.84
H200	BF16	Throughput LoRA	1	2.85
H200	BF16	Latency	2	3.39
H200 NVL	FP8	Throughput	1	1.94
H200 NVL	FP8	Throughput LoRA	1	1.96
H200 NVL	FP8	Latency	2	2.03
H200 NVL	BF16	Throughput	1	2.84
H200 NVL	BF16	Throughput LoRA	1	2.85
H200 NVL	BF16	Latency	2	3.39
GH200 480GB	FP8	Throughput	1	1.94
GH200 480GB	FP8	Throughput LoRA	1	1.97
GH200 480GB	BF16	Throughput	1	2.84
GH200 480GB	BF16	Throughput LoRA	1	2.85
GH200 144GB	FP8	Throughput	1	1.94
GH200 144GB	FP8	Throughput LoRA	1	1.96
GH200 144GB	FP8	Latency	1	1.94
GH200 144GB	BF16	Throughput	1	2.84
GH200 144GB	BF16	Throughput LoRA	1	2.85
GH200 144GB	BF16	Latency	1	2.84
GB200	FP8	Throughput	1	1.94
GB200	FP8	Throughput LoRA	1	1.96
GB200	FP8	Latency	2	2.0
GB200	BF16	Throughput	1	2.84
GB200	BF16	Throughput LoRA	1	2.84
GB200	BF16	Latency	2	3.38
H100	FP8	Throughput	1	1.95
H100	FP8	Throughput LoRA	1	1.97
H100	FP8	Latency	2	2.03
H100	BF16	Throughput	1	2.84
H100	BF16	Throughput LoRA	1	2.85
H100	BF16	Latency	2	3.39
H100 NVL	FP8	Throughput	1	1.95
H100 NVL	FP8	Throughput LoRA	1	1.97
H100 NVL	FP8	Latency	2	2.02
H100 NVL	BF16	Throughput	1	2.84
H100 NVL	BF16	Throughput LoRA	1	2.85
H100 NVL	BF16	Latency	2	3.39
A100	BF16	Throughput	1	2.84
A100	BF16	Throughput LoRA	1	2.85
A100	BF16	Latency	2	3.39
A100 40GB	BF16	Throughput	1	2.84
A100 40GB	BF16	Throughput LoRA	1	2.85
A100 40GB	BF16	Latency	2	3.39
L40S	FP8	Throughput	1	1.95
L40S	FP8	Throughput LoRA	1	1.96
L40S	FP8	Latency	2	2.03
L40S	BF16	Throughput	1	2.84
L40S	BF16	Throughput LoRA	1	2.85
L40S	BF16	Latency	2	3.38
A10G	BF16	Throughput	1	2.84
A10G	BF16	Throughput LoRA	1	2.85
A10G	BF16	Latency	2	3.38
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	1	1.95
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput LoRA	1	1.97
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	1	1.95
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	1	2.85
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput LoRA	1	2.85
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	1	2.85

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.2 3B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	4.2
B200	FP8	Throughput LoRA	1	4.23
B200	FP8	Latency	2	4.32
B200	BF16	Throughput	1	6.81
B200	BF16	Throughput LoRA	1	6.82
B200	BF16	Latency	2	7.64
H200	FP8	Throughput	1	4.21
H200	FP8	Throughput LoRA	1	4.25
H200	FP8	Latency	2	4.36
H200	BF16	Throughput	1	6.82
H200	BF16	Throughput LoRA	1	6.83
H200	BF16	Latency	2	7.66
GH200 480GB	FP8	Throughput	1	4.21
GH200 480GB	FP8	Throughput LoRA	1	4.25
GH200 480GB	BF16	Throughput	1	6.82
GH200 480GB	BF16	Throughput LoRA	1	6.83
H100	FP8	Throughput	1	4.22
H100	FP8	Throughput LoRA	1	4.25
H100	FP8	Latency	2	4.37
H100	BF16	Throughput	1	6.82
H100	BF16	Throughput LoRA	1	6.83
H100	BF16	Latency	2	7.66
H100 NVL	FP8	Throughput	1	4.22
H100 NVL	FP8	Throughput LoRA	1	4.25
H100 NVL	FP8	Latency	2	4.37
H100 NVL	BF16	Throughput	1	6.82
H100 NVL	BF16	Throughput LoRA	1	6.83
H100 NVL	BF16	Latency	2	7.66
A100	BF16	Throughput	1	6.82
A100	BF16	Throughput LoRA	1	6.83
A100	BF16	Latency	2	7.66
A100 40GB	BF16	Throughput	1	6.81
A100 40GB	BF16	Throughput LoRA	1	6.82
A100 40GB	BF16	Latency	2	7.66
L40S	FP8	Throughput	1	4.22
L40S	FP8	Throughput LoRA	1	4.26
L40S	FP8	Latency	2	4.37
L40S	BF16	Throughput	1	6.81
L40S	BF16	Throughput LoRA	1	6.82
L40S	BF16	Latency	2	7.66
A10G	BF16	Throughput	1	6.81
A10G	BF16	Throughput LoRA	1	6.82
A10G	BF16	Latency	2	7.66
H20	FP8	Throughput	1	4.2
H20	FP8	Throughput LoRA	1	4.25
H20	FP8	Latency	2	4.34
H20	BF16	Throughput	1	6.82
H20	BF16	Throughput LoRA	1	6.83
H20	BF16	Latency	2	7.66
L20	FP8	Throughput	1	4.21
L20	FP8	Throughput LoRA	1	4.25
L20	FP8	Latency	2	4.33
L20	BF16	Throughput	1	6.82
L20	BF16	Throughput LoRA	1	6.83
L20	BF16	Latency	2	7.66
NVIDIA RTX 6000 Ada Generation	FP8	Throughput	1	4.21
NVIDIA RTX 6000 Ada Generation	FP8	Throughput LoRA	1	4.26
NVIDIA RTX 6000 Ada Generation	FP8	Latency	1	4.21
NVIDIA RTX 6000 Ada Generation	BF16	Throughput	1	6.14
NVIDIA RTX 6000 Ada Generation	BF16	Throughput LoRA	1	6.16
NVIDIA RTX 6000 Ada Generation	BF16	Latency	1	6.15
GeForce RTX 4090	FP8	Throughput	1	3.48
GeForce RTX 4090	FP8	Throughput LoRA	1	3.52
GeForce RTX 4090	FP8	Latency	1	4.21
GeForce RTX 4090	BF16	Throughput	1	6.16
GeForce RTX 4090	BF16	Throughput LoRA	1	6.15
GeForce RTX 4090	BF16	Latency	1	6.81

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 1 GH200, RTX 4090, RTX 5090, or RTX 6000 ADA
- 1 or 2 B200, H200, H100 SXM, H100 NVL, H200 NVL, A100 40GB, A100, L40S, A10G, H20, or L20

Llama 3.1 70B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space values are in GB and are for both the container and the model.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	NVFP4	Throughput	1	40.05
B200	NVFP4	Latency	2	40.46
B200	FP8	Throughput	1	67.85
B200	FP8	Throughput LoRA	1	67.95
B200	FP8	Latency	2	68.1
B200	BF16	Throughput	2	133.66
B200	BF16	Throughput LoRA	2	133.75
B200	BF16	Latency	4	137.87
H200	FP8	Throughput	1	67.85
H200	FP8	Throughput LoRA	1	67.95
H200	FP8	Latency	2	68.21
H200	BF16	Throughput	2	133.74
H200	BF16	Throughput LoRA	2	133.82
H200	BF16	Latency	4	138.02
H200 NVL	FP8	Throughput	1	67.86
H200 NVL	FP8	Throughput LoRA	1	67.96
H200 NVL	FP8	Latency	2	68.07
H200 NVL	BF16	Throughput	2	133.74
H200 NVL	BF16	Throughput LoRA	2	133.82
H200 NVL	BF16	Latency	4	138.02
GH200 480GB	FP8	Throughput	1	67.88
GH200 480GB	FP8	Throughput LoRA	1	68.0
GH200 480GB	FP8	Latency	1	67.88
GH200 144GB	FP8	Throughput	2	68.22
GH200 144GB	FP8	Throughput	1	67.88
GH200 144GB	FP8	Throughput LoRA	2	68.38
GH200 144GB	FP8	Latency	2	68.21
GH200 144GB	BF16	Throughput	2	133.66
GH200 144GB	BF16	Throughput LoRA	2	133.71
GH200 144GB	BF16	Latency	2	133.66
GB200	NVFP4	Throughput	1	40.04
GB200	NVFP4	Latency	1	40.05
GB200	FP8	Throughput	1	67.84
GB200	FP8	Throughput LoRA	1	67.95
GB200	FP8	Latency	1	67.84
GB200	BF16	Throughput	2	133.66
GB200	BF16	Throughput LoRA	2	133.75
GB200	BF16	Latency	4	137.87
H100	FP8	Throughput	2	68.22
H100	FP8	Throughput LoRA	2	68.38
H100	FP8	Latency	4	68.75
H100	BF16	Throughput	4	138.37
H100	BF16	Throughput	2	133.88
H100	BF16	Throughput LoRA	4	138.89
H100	BF16	Latency	8	146.26
H100 NVL	FP8	Throughput	2	68.14
H100 NVL	FP8	Throughput LoRA	2	68.28
H100 NVL	FP8	Latency	4	68.8
H100 NVL	BF16	Throughput	2	133.58
H100 NVL	BF16	Throughput LoRA	2	133.62
H100 NVL	BF16	Latency	4	138.02
A100	BF16	Throughput	4	139.51
A100	BF16	Throughput	2	134.37
A100	BF16	Throughput LoRA	4	139.13
A100	BF16	Latency	8	146.59
A100 40GB	BF16	Throughput	8	146.43
A100 40GB	BF16	Throughput LoRA	8	146.7
A100 40GB	BF16	Latency	8	146.59
L40S	FP8	Throughput	4	68.77
L40S	FP8	Throughput LoRA	4	69.1
L40S	FP8	Latency	4	68.47
L40S	BF16	Throughput	4	139.26
L40S	BF16	Throughput LoRA	4	138.93
L40S	BF16	Latency	4	137.85
A10G	BF16	Throughput	8	149.26
A10G	BF16	Throughput LoRA	8	147.84
A10G	BF16	Latency	8	149.26
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Throughput	2	40.15
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Latency	4	40.53
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	2	68.27
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput LoRA	2	68.46
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	4	68.68
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	4	138.1
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput LoRA	4	138.29
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	8	146.55

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 405B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	8	388.75
H100 SXM	FP16	Latency	16	794.9
A100 SXM	PP16	Latency	16	798.2

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	240	FP16	100 SXM

Llama 3.1 Nemotron 70B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk space
H100 SXM	FP8	Throughput	2	68.18
H100 SXM	FP8	Throughput	4	68.64
H100 SXM	FP8	Latency	8	69.77
H100 SXM	FP16	Throughput	4	137.94
H100 SXM	FP16	Latency	8	146.41
A100 SXM	FP16	Throughput	4	137.93
A100 SXM	FP16	Latency	8	146.41

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 Swallow 8B Instruct v0.1#

This model supports LoRA and vGPU.

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2, 4

Llama 3.1 Swallow 70B Instruct v0.1#

This model supports LoRA and vGPU.

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 2, 4, 8

Llama 3.1 Typhoon 2 8B Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	8.56
B200	FP8	Throughput LoRA	1	8.6
B200	FP8	Latency	2	8.68
B200	BF16	Throughput	1	15.05
B200	BF16	Throughput LoRA	1	15.06
B200	BF16	Latency	2	16.09
H200	FP8	Throughput	1	8.58
H200	FP8	Throughput LoRA	1	8.63
H200	FP8	Latency	2	8.73
H200	BF16	Throughput	1	15.06
H200	BF16	Throughput LoRA	1	15.07
H200	BF16	Latency	2	16.12
H100	FP8	Throughput	1	8.58
H100	FP8	Throughput LoRA	1	8.63
H100	FP8	Latency	2	8.73
H100	BF16	Throughput	1	15.06
H100	BF16	Throughput LoRA	1	15.07
H100	BF16	Latency	2	16.12
A100	BF16	Throughput	1	15.06
A100	BF16	Throughput LoRA	1	15.07
A100	BF16	Latency	2	16.12
L40S	FP8	Throughput	1	8.6
L40S	FP8	Throughput LoRA	1	8.64
L40S	FP8	Latency	2	8.76
L40S	BF16	Throughput	1	15.06
L40S	BF16	Throughput LoRA	1	15.07
L40S	BF16	Throughput	2	16.15
L40S	BF16	Latency	2	16.12
L40S	BF16	Latency	4	18.26

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.1 Typhoon 2 70B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 2, 4, or 8 B200, H200, H100, H100 NVL, or A100

Llama 3.3 70B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	NVFP4	Latency	2	40.46
B200	FP8	Throughput	1	67.85
B200	FP8	Throughput LoRA	1	67.95
B200	BF16	Throughput	2	133.66
B200	BF16	Throughput LoRA	2	133.75
B200	BF16	Latency	4	137.87
H200	FP8	Throughput	1	67.85
H200	FP8	Throughput LoRA	1	67.95
H200	FP8	Latency	2	68.21
H200	BF16	Throughput	2	133.74
H200	BF16	Throughput LoRA	2	133.82
H200	BF16	Latency	4	138.02
H200 NVL	FP8	Throughput	1	67.86
H200 NVL	FP8	Throughput LoRA	1	67.95
H200 NVL	FP8	Latency	2	68.07
H200 NVL	BF16	Throughput	2	133.74
H200 NVL	BF16	Throughput LoRA	2	133.82
GH200 480GB	FP8	Throughput	1	67.88
GH200 480GB	FP8	Throughput LoRA	1	68.0
GH200 480GB	FP8	Latency	1	67.88
GH200 144GB	FP8	Throughput	1	67.88
GH200 144GB	FP8	Throughput	2	68.21
GH200 144GB	FP8	Throughput LoRA	2	68.38
GH200 144GB	FP8	Latency	2	68.22
GH200 144GB	BF16	Throughput	2	133.74
GH200 144GB	BF16	Throughput LoRA	2	133.82
GH200 144GB	BF16	Latency	2	133.74
GB200	NVFP4	Throughput	1	40.06
GB200	NVFP4	Latency	1	40.05
GB200	FP8	Throughput	1	67.84
GB200	FP8	Throughput LoRA	1	67.95
GB200	FP8	Latency	1	67.84
GB200	BF16	Throughput	2	133.66
GB200	BF16	Throughput LoRA	2	133.75
GB200	BF16	Latency	4	137.87
H100	FP8	Throughput	2	68.22
H100	FP8	Throughput LoRA	2	68.38
H100	FP8	Latency	4	68.75
H100	BF16	Throughput	2	138.38
H100	BF16	Throughput	4	138.38
H100	BF16	Throughput LoRA	4	138.9
H100 NVL	FP8	Throughput	2	68.14
H100 NVL	FP8	Throughput LoRA	2	68.28
H100 NVL	FP8	Latency	4	68.8
H100 NVL	BF16	Throughput	2	133.58
H100 NVL	BF16	Throughput LoRA	2	98.06
H100 NVL	BF16	Latency	4	138.02
A100	BF16	Throughput	2	134.37
A100	BF16	Throughput	4	139.5
A100	BF16	Throughput LoRA	4	139.13
A100	BF16	Latency	8	146.59
A100 40GB	BF16	Throughput	8	146.43
A100 40GB	BF16	Throughput LoRA	8	146.7
A100 40GB	BF16	Latency	8	146.59
L40S	FP8	Throughput	4	68.78
L40S	FP8	Throughput LoRA	4	69.1
L40S	FP8	Latency	4	68.46
L40S	BF16	Throughput	4	139.26
L40S	BF16	Throughput LoRA	4	138.9
L40S	BF16	Latency	4	137.85
A10G	BF16	Throughput	8	149.26
A10G	BF16	Throughput LoRA	8	147.83
A10G	BF16	Latency	8	149.26
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Latency	4	40.53
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	2	68.27
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput LoRA	2	68.46
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	4	68.7
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	4	138.1
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput LoRA	4	138.29
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	8	66.07

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs:
- 2 A100, B200, H100, H100 NVL, H200
- 4 A100, B200, H100, H100 NVL, H200, L40S
- 8 A100, H100, H100 NVL, H200, L40S

Meta Llama 3 8B Instruct#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Throughput	1	28
H100 SXM	FP16	Latency	2	28
A100 SXM	FP16	Throughput	1	28
A100 SXM	FP16	Latency	2	28
L40S	FP8	Throughput	1	20.5
L40S	FP8	Latency	2	20.5
L40S	FP16	Throughput	1	28
A10G	FP16	Throughput	1	28
A10G	FP16	Latency	2	28

Generic Configuration#

The Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Llama 3.3 Nemotron Super 49B V1#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	2	48.56
B200	FP8	Latency	2	48.56
B200	BF16	Throughput	1	92.99
B200	BF16	Latency	2	95.08
H200	FP8	Throughput	2	48.58
H200	FP8	Latency	2	48.58
H200	BF16	Throughput	1	93.0
H200	BF16	Latency	2	95.11
H100	FP8	Throughput	1	48.53
H100	FP8	Latency	2	48.81
H100	BF16	Throughput	2	95.14
H100	BF16	Latency	4	99.3
H100 NVL	FP8	Throughput	1	48.53
H100 NVL	FP8	Latency	2	48.81
H100 NVL	BF16	Throughput	2	95.14
H100 NVL	BF16	Latency	4	99.3
A100	BF16	Throughput	2	95.14
A100	BF16	Latency	4	99.29
A100 40GB	BF16	Throughput	4	99.35
A100 40GB	BF16	Latency	8	107.78
L40S	FP8	Latency	4	49.21

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Llama 3.3 Nemotron Super 49B V1.5#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	NVFP4	Throughput	1	29.16
B200	NVFP4	Latency	2	29.23
B200	FP8	Throughput	1	48.53
B200	FP8	Latency	2	48.76
B200	BF16	Throughput	1	93.01
B200	BF16	Latency	2	95.1
H200	FP8	Throughput	1	48.56
H200	FP8	Latency	2	48.83
H200	BF16	Throughput	1	93.02
H200	BF16	Latency	4	99.36
H200 NVL	FP8	Throughput	1	48.58
H200 NVL	FP8	Latency	2	48.88
H200 NVL	BF16	Throughput	1	93.03
H200 NVL	BF16	Latency	2	95.14
GH200 480GB	FP8	Throughput	1	48.56
GH200 480GB	FP8	Latency	1	48.58
GH200 144GB	FP8	Throughput	1	48.58
GH200 144GB	FP8	Latency	2	48.89
GH200 144GB	BF16	Throughput	1	93.0
GH200 144GB	BF16	Latency	2	95.14
GB200	NVFP4	Throughput	1	29.16
GB200	NVFP4	Latency	2	29.23
GB200	FP8	Throughput	1	48.55
GB200	FP8	Latency	2	48.75
GB200	BF16	Throughput	1	93.01
GB200	BF16	Latency	1	93.01
H100 SXM	FP8	Throughput	1	48.54
H100 SXM	FP8	Latency	2	48.83
A100	BF16	Throughput	2	95.14
A100	BF16	Latency	4	99.36
A100 40GB	BF16	Throughput	4	99.36
A100 40GB	BF16	Latency	8	107.8
L40S	FP8	Throughput	4	49.3
L40S	FP8	Latency	4	49.3
L40S	BF16	Throughput	4	99.29
L40S	BF16	Latency	4	99.29
A10G	BF16	Throughput	8	99.49
A10G	BF16	Latency	8	99.49
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Throughput	2	29.58
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Throughput	1	29.19
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	Latency	2	29.58
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	1	48.58
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	2	48.89
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	2	95.17
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	4	100.3

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Meta Llama 3 70B Instruct#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	4	82
H100 SXM	FP8	Latency	8	82
H100 SXM	FP16	Throughput	4	158
H100 SXM	FP16	Latency	8	158
A100 SXM	FP16	Throughput	4	158

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	240	FP16	100 SXM

Mistral 7B Instruct V0.3#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	7.06
B200	FP8	Throughput LoRA	1	7.1
B200	FP8	Latency	2	7.14
B200	BF16	Throughput	1	13.55
B200	BF16	Throughput LoRA	1	13.56
B200	BF16	Latency	2	13.85
H200	FP8	Throughput	1	7.07
H200	FP8	Throughput LoRA	1	7.12
H200	FP8	Latency	2	7.19
H200	BF16	Throughput	1	13.56
H200	BF16	Throughput LoRA	1	13.57
H200	BF16	Latency	2	13.87
H200 NVL	FP8	Throughput	1	7.08
H200 NVL	FP8	Throughput LoRA	1	7.11
H200 NVL	FP8	Latency	2	7.19
H200 NVL	BF16	Throughput	1	13.56
H200 NVL	BF16	Throughput LoRA	1	13.57
H200 NVL	BF16	Latency	2	13.86
GH200 480GB	FP8	Throughput	1	7.07
GH200 480GB	FP8	Throughput LoRA	1	7.11
GH200 480GB	FP8	Latency	1	7.07
GH200 480GB	BF16	Throughput	1	13.56
GH200 480GB	BF16	Throughput LoRA	1	13.57
GH200 480GB	BF16	Latency	1	13.56
GH200 144GB	FP8	Throughput	1	7.07
GH200 144GB	FP8	Throughput LoRA	1	7.11
GH200 144GB	FP8	Latency	1	7.07
GH200 144GB	BF16	Throughput	1	13.56
GH200 144GB	BF16	Throughput LoRA	1	13.57
GH200 144GB	BF16	Latency	1	13.56
GB200	FP8	Throughput	1	7.06
GB200	FP8	Throughput LoRA	1	7.1
GB200	FP8	Latency	2	7.15
GB200	BF16	Throughput	1	13.55
GB200	BF16	Throughput LoRA	1	13.56
GB200	BF16	Latency	2	13.85
H100	FP8	Throughput	1	7.08
H100	FP8	Throughput LoRA	1	7.11
H100	FP8	Latency	2	7.19
H100	BF16	Throughput	1	13.55
H100	BF16	Throughput LoRA	1	13.56
H100	BF16	Latency	2	13.87
H100 NVL	FP8	Throughput	1	7.07
H100 NVL	FP8	Throughput LoRA	1	7.11
H100 NVL	FP8	Latency	2	7.19
H100 NVL	BF16	Throughput	1	13.56
H100 NVL	BF16	Throughput LoRA	1	13.57
H100 NVL	BF16	Latency	2	13.87
A100	BF16	Throughput	1	13.55
A100	BF16	Throughput LoRA	1	13.56
A100	BF16	Latency	2	13.86
A100 40GB	BF16	Throughput	1	13.55
A100 40GB	BF16	Throughput LoRA	1	13.55
A100 40GB	BF16	Latency	2	13.88
L40S	FP8	Throughput	1	7.07
L40S	FP8	Throughput LoRA	1	7.1
L40S	FP8	Latency	2	7.17
L40S	BF16	Throughput	1	13.54
L40S	BF16	Throughput LoRA	1	13.55
L40S	BF16	Latency	2	13.84
A10G	BF16	Throughput	1	13.54
A10G	BF16	Throughput LoRA	1	13.55
A10G	BF16	Latency	2	13.87
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	1	7.09
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput LoRA	1	7.13
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	1	7.09
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	1	13.57
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput LoRA	1	13.58
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	1	13.55

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Mistral NeMo Minitron 8B 8K Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	8.91
H100 SXM	FP8	Latency	2	9.03
H100 SXM	FP16	Throughput	1	15.72
H100 SXM	FP16	Latency	2	16.78
A100 SXM	FP16	Throughput	1	15.72
A100 SXM	FP16	Latency	2	16.78
L40S	FP8	Throughput	1	8.92
L40S	FP8	Latency	2	9.02
L40S	FP16	Throughput	1	15.72
L40S	FP16	Latency	2	16.77
A10G	FP16	Throughput	2	16.81
A10G	FP16	Latency	4	15.72

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral NeMo 12B Instruct RTX#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
NVIDIA RTX 6000 Ada Generation	INT4 AWQ	Throughput	1	31
GeForce RTX 5090	INT4 AWQ	Throughput	1	31
GeForce RTX 5080	INT4 AWQ	Throughput	1	31
GeForce RTX 4090	INT4 AWQ	Throughput	1	31
GeForce RTX 4080	INT4 AWQ	Throughput	1	31

Mistral NeMo 12B Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Latency	2	13.82
H100 SXM	FP16	Throughput	1	23.35
H100 SXM	FP16	Latency	2	25.14
A100 SXM	FP16	Throughput	1	23.35
A100 SXM	FP16	Latency	2	25.14
L40S	FP8	Throughput	2	13.83
L40S	FP8	Latency	4	15.01
L40S	FP16	Throughput	2	25.14
L40S	FP16	Latency	4	28.71
A10G	FP16	Throughput	4	28.71
A10G	FP16	Latency	8	35.87

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Mistral Small 24b Instruct 2501#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: BF16
# of GPUs: 1, 2

Mixtral 8x7B Instruct V0.1#

This model supports vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	43.81
B200	FP8	Latency	2	43.9
B200	BF16	Throughput	1	87.04
B200	BF16	Latency	2	87.34
H200	FP8	Throughput	1	43.82
H200	FP8	Latency	2	43.92
H200	BF16	Throughput	1	87.04
H200	BF16	Latency	2	87.36
H200 NVL	FP8	Throughput	1	43.82
H200 NVL	FP8	Latency	2	43.92
H200 NVL	BF16	Throughput	1	87.05
H200 NVL	BF16	Latency	2	87.36
GH200 144GB	FP8	Throughput	1	43.82
GH200 144GB	FP8	Latency	1	43.82
GH200 144GB	BF16	Throughput	1	87.05
GB200	FP8	Throughput	1	43.82
GB200	FP8	Latency	2	43.9
GB200	BF16	Throughput	1	87.04
GB200	BF16	Latency	2	87.34
H100	FP8	Throughput	1	43.83
H100	FP8	Latency	2	43.89
H100	BF16	Throughput	2	87.36
H100	BF16	Latency	4	87.98
H100 NVL	FP8	Throughput	1	43.82
H100 NVL	FP8	Latency	2	43.92
H100 NVL	BF16	Throughput	1	87.04
H100 NVL	BF16	Latency	2	87.36
A100	BF16	Throughput	2	87.36
A100	BF16	Latency	4	87.98
A100 40GB	BF16	Throughput	4	87.98
A100 40GB	BF16	Latency	8	89.21
L40S	FP8	Throughput	2	43.93
L40S	FP8	Latency	4	44.11
L40S	BF16	Throughput	4	87.97
L40S	BF16	Latency	4	87.97
A10G	BF16	Throughput	8	89.2
A10G	BF16	Latency	8	89.2
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	1	43.84
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	1	43.84
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	1	87.04
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	1	87.05

Generic Configuration#

The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.

GPUs	GPU Memory	Precision	Disk Space
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.	24	FP16	16

Mixtral 8x22B Instruct V0.1#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	8	132.61
H100 SXM	FP8	Latency	8	132.56
H100 SXM	INT8WO	Throughput	8	134.82
H100 SXM	INT8WO	Latency	8	132.31
H100 SXM	FP16	Throughput	8	265.59
A100 SXM	FP16	Throughput	8	265.7

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Instruct#

This model supports LoRA.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Latency	16	636.45
A100 SXM	FP16	Latency	16	636.45

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Nemotron 4 340B Reward#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP16	Latency	16	636.45
A100 SXM	FP16	Latency	16	636.45

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

NVIDIA-Nemotron-Nano-9B-v2#

This model was tested with the following configurations:

GPU	Precision	# of GPUs
B200 SXM/NVLink	BF16	1, 2
H200 SXM/NVLink	BF16	1, 2
H100 SXM/NVLink	BF16	1, 2
A100 SXM 80GB	BF16	1, 2
A100 SXM 40GB	BF16	1, 2
L40S PCIe	BF16	1, 2
A10G	BF16	1
H100 NVL	BF16	1, 2
H200 NVL	BF16	1, 2
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	1

NVIDIA-Nemotron-Nano-9B-v2-DGX-Spark#

This NIM only runs on DGX Spark.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
DGX Spark	NVFP4	Throughput	1	7.85

Known Functional Differences#

This NIM is built with a different base container and is subject to limitations. Refer to Notes on NIM Container Variants for more information.

Phi 3 Mini 4K Instruct#

This model supports LoRA and vGPU.

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	1	3.8
H100 SXM	FP16	Throughput	1	7.14
A100 SXM	FP16	Throughput	1	7.14
L40S	FP8	Throughput	1	3.8
L40S	FP16	Throughput	1	7.14
A10G	FP16	Throughput	1	7.14

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Phi 4 Mini Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Phind Codellama 34B V2 Instruct#

This model supports LoRA and vGPU.

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H100 SXM	FP8	Throughput	2	32.17
H100 SXM	FP8	Latency	4	32.41
H100 SXM	FP16	Throughput	2	63.48
H100 SXM	FP16	Latency	4	64.59
A100 SXM	FP16	Throughput	2	63.48
A100 SXM	FP16	Latency	4	64.59
L40S	FP8	Throughput	4	32.43
L40S	FP16	Throughput	4	64.58
A10G	FP16	Latency	8	66.8

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Riva Translate 4B Instruct#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Sarvam - M#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM buildable profiles#

Precision: BF16
# of GPUs: 1, 2, or 4

SILMA 9B Instruct v1.0#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM buildable profiles#

Precision: BF16
# of GPUs:
- 1 or 2 H200, H100 SXM, H100 NVL, A100, or L40S
- 2 or 4 A10G

StarCoder2 7B#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
H200	BF16	Throughput	1	13.46
H200	BF16	Throughput	2	14.44
H200	BF16	Latency	2	14.43
H100	BF16	Throughput	1	13.47
H100	BF16	Throughput	2	14.44
H100	BF16	Latency	2	14.43

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

StarCoderBase 15.5B#

This model supports LoRA and vGPU.

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported TRT-LLM Buildable Profiles#

Precision: FP32
# of GPUs: 2, 4, 8

Stockmark-2-100B-Instruct#

Optimized Configurations#

Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.

GPU	Precision	Profile	# of GPUs	Disk Space
B200	FP8	Throughput	1	91.23
B200	FP8	Throughput LoRA	1	91.29
B200	FP8	Latency	2	91.42
B200	BF16	Throughput	2	180.69
B200	BF16	Throughput LoRA	2	180.76
B200	BF16	Latency	4	184.35
H200	FP8	Throughput	1	91.26
H200	FP8	Throughput LoRA	1	91.35
H200	FP8	Latency	2	91.57
H200	BF16	Throughput	2	180.74
H200	BF16	Throughput LoRA	2	180.8
H200	BF16	Latency	4	184.58
H200 NVL	FP8	Throughput	1	91.26
H200 NVL	FP8	Throughput LoRA	1	91.35
H200 NVL	FP8	Latency	2	91.66
H200 NVL	BF16	Throughput	2	180.83
H200 NVL	BF16	Throughput LoRA	2	180.92
H200 NVL	BF16	Latency	4	184.58
GH200 144GB	FP8	Throughput	1	91.26
GH200 144GB	FP8	Throughput LoRA	1	91.35
GH200 144GB	FP8	Latency	2	91.65
GH200 144GB	BF16	Throughput	2	180.74
GH200 144GB	BF16	Throughput LoRA	2	180.8
GB200	FP8	Throughput	1	91.23
GB200	FP8	Throughput LoRA	1	91.29
GB200	FP8	Latency	2	91.42
GB200	BF16	Throughput	2	180.69
GB200	BF16	Throughput LoRA	2	180.76
GB200	BF16	Latency	4	184.35
H100	FP8	Throughput	2	91.64
H100	FP8	Throughput LoRA	2	91.82
H100	FP8	Latency	4	92.01
H100	BF16	Throughput	4	184.58
H100	BF16	Throughput LoRA	4	184.76
H100	BF16	Latency	8	192.09
H100 NVL	FP8	Throughput	2	91.58
H100 NVL	FP8	Throughput LoRA	2	91.71
H100 NVL	FP8	Latency	4	92.18
H100 NVL	BF16	Throughput	4	184.49
H100 NVL	BF16	Throughput LoRA	4	184.64
H100 NVL	BF16	Latency	8	192.09
A100	BF16	Throughput	4	184.49
A100	BF16	Throughput LoRA	4	184.64
A100	BF16	Latency	8	192.08
A100 40GB	BF16	Throughput	8	192.08
A100 40GB	BF16	Throughput LoRA	8	192.43
L40S	FP8	Throughput	4	92.0
L40S	FP8	Throughput LoRA	4	92.26
L40S	FP8	Latency	8	93.17
L40S	BF16	Throughput	8	196.34
L40S	BF16	Throughput LoRA	8	194.87
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput	2	91.59
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Throughput LoRA	2	91.77
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	Latency	4	92.24
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput	4	184.57
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Throughput LoRA	4	184.74
NVIDIA RTX PRO 6000 Blackwell Server Edition	BF16	Latency	8	192.26

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.

Supported vLLM Profiles#

Precision: BF16
# of GPUs:
- 2 B200 SXM, GB200, H200 SXM, H200 NVL, or GH200 144GB
- 4 B200 SXM, GB200, H200 SXM, H200 NVL, H100 SXM, H100 NVL, A100 SXM 80GB, or NVIDIA RTX PRO 6000 Blackwell Server Edition
- 8 B200 SXM, H200 SXM, H200 NVL, H100 SXM, H100 NVL, A100 SXM 80GB, A100 SXM 40GB, L40S, or NVIDIA RTX PRO 6000 Blackwell Server Edition

Teuken 7B Instruct Commercial v0.4#

Generic Configuration#

Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.