Supported Models#
GPUs#
The GPU listed in the following sections have the following specifications.
GPU |
Family |
Memory |
---|---|---|
H200 |
SXM/NVLink |
141 GB |
H100 |
SXM/NVLink |
80 GB |
A100 |
SXM/NVLink |
80 GB |
L40S |
PCIe |
48 GB |
A10G |
PCIe |
24 GB |
NVIDIA RTX 6000 Ada Generation |
32 GB |
|
GeForce RTX 5090 |
32 GB |
|
GeForce RTX 5080 |
16 GB |
|
GeForce RTX 4090 |
24 GB |
|
GeForce RTX 4080 |
16 GB |
Optimized Models#
The following models are optimized using TRT-LLM and are available as pre-built, optimized engines on NGC and should use the Chat Completions Endpoint. For vGPU environment, the GPU memory values in the following sections refers to the total GPU memory, including the reserved GPU memory for vGPU setup.
NVIDIA also provides generic model profiles that operate with any NVIDIA GPU (or set of GPUs) with sufficient memory capacity. Generic model profiles can be identified by the presence of local_build
or vllm
in the profile name. On systems where there are no compatible optimized profiles, generic profiles are chosen automatically. Optimized profiles are preferred over generic profiles when available, but you can choose to deploy a generic profile on any system by following the steps at Profile Selection.
You can also find additional information about the features, such as LoRA, that these models support in Models.
Code Llama 13B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP16 |
Throughput |
2 |
24.63 |
H100 SXM |
FP16 |
Latency |
4 |
25.32 |
A100 SXM |
FP16 |
Throughput |
2 |
24.63 |
A100 SXM |
FP16 |
Latency |
4 |
25.31 |
L40S |
FP16 |
Throughput |
2 |
25.32 |
L40S |
FP16 |
Latency |
2 |
24.63 |
A10G |
FP16 |
Throughput |
4 |
25.32 |
A10G |
FP16 |
Latency |
8 |
26.69 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Code Llama 34B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
2 |
32.17 |
H100 SXM |
FP8 |
Latency |
4 |
32.42 |
H100 SXM |
FP16 |
Throughput |
2 |
63.48 |
H100 SXM |
FP16 |
Latency |
4 |
64.59 |
A100 SXM |
FP16 |
Throughput |
2 |
63.48 |
A100 SXM |
FP16 |
Latency |
4 |
64.59 |
L40S |
FP8 |
Throughput |
4 |
32.42 |
L40S |
FP16 |
Throughput |
4 |
64.58 |
A10G |
FP16 |
Throughput |
4 |
64.58 |
A10G |
FP16 |
Latency |
8 |
66.8 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Code Llama 70B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
4 |
65.47 |
H100 SXM |
FP8 |
Latency |
8 |
66.37 |
H100 SXM |
FP16 |
Throughput |
4 |
130.35 |
H100 SXM |
FP16 |
Latency |
8 |
66.37 |
A100 SXM |
FP16 |
Throughput |
4 |
130.35 |
A100 SXM |
FP16 |
Latency |
8 |
132.71 |
A10G |
FP16 |
Throughput |
8 |
132.69 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
DeepSeek R1#
Supported Configurations#
The following configurations support this model:
8 x H200 (1)
2 Nodes of [8 x H100] for 16 total H100 GPU’s
Refer to the NGC catalog entry for further information.
DeepSeek R1 Distill Llama 8B#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H200 SXM |
FP8 |
Throughput |
1 |
8.58 |
H200 SXM |
FP8 |
Latency |
2 |
8.72 |
H200 SXM |
BF16 |
Throughput |
1 |
15.05 |
H200 SXM |
BF16 |
Latency |
2 |
16.12 |
H100 SXM |
FP8 |
Throughput |
1 |
8.58 |
H100 SXM |
FP8 |
Latency |
2 |
8.74 |
H100 SXM |
BF16 |
Throughput |
1 |
15.05 |
H100 SXM |
BF16 |
Latency |
2 |
16.12 |
H100 NVL |
FP8 |
Throughput |
1 |
8.58 |
H100 NVL |
FP8 |
Latency |
2 |
8.73 |
H100 NVL |
BF16 |
Latency |
2 |
16.12 |
H100 NVL |
BF16 |
Throughput |
1 |
15.0 |
A100 SXM |
BF16 |
Throughput |
1 |
15.16 |
A100 SXM |
BF16 |
Latency |
2 |
16.36 |
L40S |
FP8 |
Throughput |
1 |
8.58 |
L40S |
FP8 |
Latency |
2 |
8.71 |
L40S |
BF16 |
Throughput |
1 |
15.14 |
L40S |
BF16 |
Latency |
2 |
16.32 |
A10G |
BF16 |
Throughput |
2 |
16.12 |
A10G |
BF16 |
Latency |
4 |
18.25 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
DeepSeek R1 Distill Llama 70B#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H200 |
FP8 |
Latency |
4 |
68.66 |
H200 |
FP8 |
Throughput |
2 |
68.12 |
H200 |
BF16 |
Latency |
8 |
146.18 |
H200 |
BF16 |
Throughput |
4 |
137.77 |
H100 |
FP8 |
Latency |
4 |
68.65 |
H100 |
FP8 |
Throughput |
2 |
68.18 |
H100 |
FP8 |
Latency |
8 |
69.6 |
H100 |
BF16 |
Latency |
8 |
146.18 |
H100 |
BF16 |
Throughput |
4 |
137.77 |
A100 |
BF16 |
Latency |
8 |
146.19 |
A100 |
BF16 |
Throughput |
4 |
137.82 |
L40S |
FP8 |
Throughput |
4 |
68.57 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported Releases#
1.5
DeepSeek R1 Distill Llama 8B RTX#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
NVIDIA RTX 6000 Ada Generation |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 5090 |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 5080 |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 4090 |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 4080 |
INT4 AWQ |
Throughput |
1 |
5.42 |
DeepSeek-R1-Distill-Qwen-32B#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H200 |
BF16 |
Throughput |
1 |
61.19 |
H100 |
BF16 |
Throughput |
1 |
61.19 |
H200 |
BF16 |
Throughput |
2 |
62.77 |
H20 |
BF16 |
Throughput |
1 |
61.19 |
A100 |
BF16 |
Throughput |
1 |
61.18 |
L40S |
BF16 |
Throughput |
2 |
62.79 |
L20 |
BF16 |
Throughput |
2 |
62.8 |
L40S |
FP8 |
Throughput |
2 |
32.49 |
H200 |
FP8 |
Throughput |
1 |
32.15 |
H200 |
FP8 |
Throughput |
2 |
32.45 |
H100 |
FP8 |
Throughput |
1 |
32.14 |
H20 |
FP8 |
Throughput |
1 |
32.12 |
L20 |
FP8 |
Throughput |
1 |
32.16 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported Releases#
1.8
Additional Information#
Organization |
Catalog Page |
LoRA Support |
Tool Calling Support |
Parallel Tool Calling Support |
---|---|---|---|---|
DeepSeek |
No |
No |
No |
DeepSeek-R1-Distill-Qwen-7B#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H200 |
BF16 |
Throughput |
1 |
21.93 |
H200 |
FP8 |
Throughput |
1 |
15.85 |
H100 |
BF16 |
Throughput |
1 |
21.94 |
H100 |
FP8 |
Throughput |
1 |
15.84 |
H20 |
BF16 |
Throughput |
1 |
22.00 |
H20 |
FP8 |
Throughput |
1 |
15.83 |
L20 |
BF16 |
Throughput |
1 |
21.97 |
L20 |
FP8 |
Throughput |
1 |
15.84 |
A100 |
BF16 |
Throughput |
1 |
21.98 |
A10G |
BF16 |
Throughput |
1 |
21.97 |
Supported Releases#
1.5
Additional Information#
Organization |
Catalog Page |
LoRA Support |
Tool Calling Support |
Parallel Tool Calling Support |
---|---|---|---|---|
DeepSeek |
No |
No |
No |
DeepSeek-R1-Distill-Qwen-14B#
Optimized configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H20 |
FP8 |
Throughput |
1 |
22.52 |
H20 |
BF16 |
Throughput |
1 |
34.98 |
L20 |
FP8 |
Throughput |
1 |
22.54 |
L20 |
BF16 |
Throughput |
1 |
34.96 |
H100 |
FP8 |
Throughput |
1 |
22.54 |
H200 |
FP8 |
Throughput |
1 |
22.54 |
H200 |
BF16 |
Throughput |
1 |
34.87 |
L40S |
FP8 |
Throughput |
1 |
22.55 |
Generic configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported Releases#
This model is supported in the following releases.
1 |
1.0 |
1.0.0 |
1.0.1 |
1.0.3 |
1.1 |
1.1.0 |
1.1.1 |
1.1.2 |
1.2 |
1.2.0 |
1.2.1 |
1.2.3 |
1.3 |
1.4 |
1.5 |
1.6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x |
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: 1
Qwen2.5 72B Instruct#
Optimized configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H20 |
FP8 |
Throughput |
4 |
77.71 |
H20 |
FP8 |
Throughput |
8 |
77.96 |
H20 |
FP8 |
Latency |
4 |
78.22 |
H20 |
FP8 |
Latency |
8 |
78.98 |
L20 |
FP8 |
Throughput |
4 |
78.14 |
L20 |
FP8 |
Throughput |
8 |
79.15 |
L20 |
FP8 |
Latency |
4 |
78.14 |
L20 |
FP8 |
Latency |
8 |
78.89 |
A100 SXM |
BF16 |
Throughput |
4 |
150.35 |
A100 SXM |
BF16 |
Latency |
8 |
160.18 |
Generic configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported Releases#
This model is supported in the following releases.
1 |
1.0 |
1.0.0 |
1.0.1 |
1.0.3 |
1.1 |
1.1.0 |
1.1.1 |
1.1.2 |
1.2 |
1.2.0 |
1.2.1 |
1.2.3 |
1.3 |
1.4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x |
Qwen2.5 7B Instruct#
Optimized configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
L20 |
FP16 |
Throughput |
1 |
21.66 |
A100 PCIe 40GB |
FP16 |
Latency |
1 |
21.66 |
A100 PCIe 40GB |
BF16 |
Throughput |
1 |
21.66 |
A100 PCIe 40GB |
FP16 |
Balanced |
1 |
21.66 |
A100 SXM/NVLink |
FP16 |
Latency |
1 |
21.66 |
A100 SXM/NVLink |
BF16 |
Throughput |
1 |
21.66 |
A100 SXM/NVLink |
BF16 |
Balanced |
1 |
21.66 |
Generic configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported Releases#
This model is supported in the following releases.
1 |
1.0 |
1.0.0 |
1.0.1 |
1.0.3 |
1.1 |
1.1.0 |
1.1.1 |
1.1.2 |
1.2 |
1.2.0 |
1.2.1 |
1.2.3 |
1.3 |
1.4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x |
Supported TRT-LLM Buildable Profiles#
Precision: BF16,FP16
# of GPUs: 1
Gemma 2 2B#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: 1, 2
Gemma 2 9B#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: 1, 2, or 4
(Meta) Llama 2 7B Chat#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
1 |
6.57 |
H100 SXM |
FP8 |
Latency |
2 |
6.66 |
H100 SXM |
FP16 |
Throughput |
1 |
12.62 |
H100 SXM |
FP16 |
Throughput LoRA |
1 |
12.63 |
H100 SXM |
FP16 |
Latency |
2 |
12.93 |
A100 SXM |
FP16 |
Throughput |
1 |
15.54 |
A100 SXM |
FP16 |
Throughput LoRA |
1 |
12.63 |
A100 SXM |
FP16 |
Latency |
2 |
12.92 |
L40S |
FP8 |
Throughput |
1 |
6.57 |
L40S |
FP8 |
Latency |
2 |
6.64 |
L40S |
FP16 |
Throughput |
1 |
12.64 |
L40S |
FP16 |
Throughput LoRA |
1 |
12.65 |
L40S |
FP16 |
Latency |
2 |
12.95 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
(Meta) Llama 2 13B Chat#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Latency |
2 |
12.6 |
H100 SXM |
FP16 |
Throughput |
1 |
24.33 |
H100 SXM |
FP16 |
Throughput LoRA |
1 |
24.35 |
H100 SXM |
FP16 |
Latency |
2 |
24.71 |
A100 SXM |
FP16 |
Throughput |
1 |
24.34 |
A100 SXM |
FP16 |
Throughput LoRA |
1 |
24.37 |
A100 SXM |
FP16 |
Latency |
2 |
24.74 |
L40S |
FP8 |
Throughput |
1 |
12.49 |
L40S |
FP8 |
Latency |
2 |
12.59 |
L40S |
FP16 |
Throughput |
1 |
24.33 |
L40S |
FP16 |
Throughput LoRA |
1 |
24.37 |
L40S |
FP16 |
Latency |
2 |
24.7 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
(Meta) Llama 2 70B Chat#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
2 |
65.08 |
H100 SXM |
FP8 |
Latency |
4 |
65.36 |
H100 SXM |
FP16 |
Throughput |
4 |
130.52 |
H100 SXM |
FP16 |
Throughput LoRA |
4 |
130.6 |
H100 SXM |
FP16 |
Latency |
8 |
133.18 |
A100 SXM |
FP16 |
Throughput |
4 |
130.52 |
A100 SXM |
FP16 |
Throughput LoRA |
4 |
130.5 |
A100 SXM |
FP16 |
Latency |
8 |
133.12 |
L40S |
FP8 |
Throughput |
4 |
63.35 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Llama 3 SQLCoder 8B#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
1 |
8.52 |
H100 SXM |
FP8 |
Latency |
2 |
8.61 |
H100 SXM |
FP16 |
Throughput |
1 |
15 |
H100 SXM |
FP16 |
Latency |
2 |
16.02 |
L40S |
FP8 |
Throughput |
1 |
8.53 |
L40S |
FP8 |
Latency |
2 |
8.61 |
L40S |
FP16 |
Throughput |
1 |
15 |
L40S |
FP16 |
Latency |
2 |
16.02 |
A10G |
FP16 |
Throughput |
1 |
15 |
A10G |
FP16 |
Throughput |
2 |
16.02 |
A10G |
FP16 |
Latency |
2 |
16.02 |
A10G |
FP16 |
Latency |
4 |
18.06 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Llama 3 Swallow 70B Instruct V0.1#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
2 |
68.42 |
H100 SXM |
FP8 |
Latency |
4 |
69.3 |
H100 SXM |
FP16 |
Throughput |
2 |
137.7 |
H100 SXM |
FP16 |
Latency |
4 |
145.94 |
A100 SXM |
FP16 |
Throughput |
2 |
137.7 |
A100 SXM |
FP16 |
Latency |
2 |
137.7 |
L40S |
FP8 |
Throughput |
2 |
68.48 |
A10G |
FP16 |
Throughput |
4 |
145.93 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Llama 3 Taiwan 70B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
2 |
68.42 |
H100 SXM |
FP8 |
Latency |
4 |
145.94 |
H100 SXM |
FP16 |
Throughput |
2 |
137.7 |
H100 SXM |
FP16 |
Latency |
4 |
137.7 |
A100 SXM |
FP16 |
Throughput |
2 |
137.7 |
A100 SXM |
FP16 |
Latency |
2 |
145.94 |
L40S |
FP8 |
Throughput |
2 |
68.48 |
A10G |
FP16 |
Throughput |
4 |
145.93 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Llama 3.1 8B Base#
Optimized Configurations#
Profile is for what the model is optimized.
GPU |
Precision |
Profile |
# of GPUs |
---|---|---|---|
H100 SXM |
BF16 |
Latency |
2 |
H100 SXM |
FP8 |
Latency |
2 |
H100 SXM |
BF16 |
Throughput |
1 |
H100 SXM |
FP8 |
Throughput |
1 |
A100 SXM |
BF16 |
Latency |
2 |
A100 SXM |
BF16 |
Throughput |
1 |
L40S |
BF16 |
Latency |
2 |
L40S |
BF16 |
Throughput |
2 |
A10G |
BF16 |
Latency |
4 |
A10G |
BF16 |
Throughput |
2 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
24 |
FP16 |
15 |
Llama 3.1 8B Instruct#
Optimized Configurations#
Profile is for what the model is optimized.
GPU |
Profile |
# of GPUs |
---|---|---|
H100 SXM |
Throughput |
1 |
H100 SXM |
Latency |
2 |
H100 NVL |
Throughput |
1 |
H100 NVL |
Latency |
2 |
A100 SXM |
Throughput |
1 |
A100 SXM |
Latency |
2 |
L40S |
Throughput |
2 |
L40S |
Latency |
4 |
A10G |
Throughput |
2 |
A10G |
Latency |
4 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
24 |
FP16 |
15 |
Llama 3.1 8B Instruct RTX#
Optimized Configurations#
Profile is for what the model is optimized.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
NVIDIA RTX 6000 Ada Generation |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 5090 |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 5080 |
INT4 AWQ |
Throughput |
1 |
5.41 |
GeForce RTX 4090 |
INT4 AWQ |
Throughput |
1 |
5.42 |
GeForce RTX 4080 |
INT4 AWQ |
Throughput |
1 |
5.42 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
24 |
FP16 |
15 |
Llama 3.2 1B Instruct#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: One H100, A100, L40S, or A10G
Llama 3.2 3B Instruct#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: One H100, A100, or L40S
Llama 3.1 70B Instruct#
Optimized Configurations#
Profile is for what the model is optimized.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H200 SXM |
FP8 |
Throughput |
1 |
67.87 |
H200 SXM |
FP8 |
Latency |
2 |
68.2 |
H200 SXM |
BF16 |
Throughput |
2 |
133.72 |
H200 SXM |
BF16 |
Latency |
4 |
137.99 |
H100 SXM |
FP8 |
Throughput |
2 |
68.2 |
H100 SXM |
FP8 |
Throughput |
4 |
68.72 |
H100 SXM |
FP8 |
Latency |
8 |
69.71 |
H100 SXM |
BF16 |
Throughput |
4 |
138.39 |
H100 SXM |
BF16 |
Latency |
8 |
147.66 |
H100 NVL |
FP8 |
Throughput |
2 |
68.2 |
H100 NVL |
FP8 |
Latency |
4 |
68.72 |
H100 NVL |
BF16 |
Throughput |
2 |
133.95 |
H100 NVL |
BF16 |
Throughput |
4 |
138.4 |
H100 NVL |
BF16 |
Latency |
8 |
147.37 |
A100 SXM |
BF16 |
Throughput |
4 |
138.53 |
A100 SXM |
BF16 |
Latency |
8 |
147.44 |
L40S |
BF16 |
Throughput |
4 |
138.49 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Llama 3.1 405B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Latency |
8 |
388.75 |
H100 SXM |
FP16 |
Latency |
16 |
794.9 |
A100 SXM |
PP16 |
Latency |
16 |
798.2 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
240 |
FP16 |
100 SXM |
Llama 3.1 Nemotron 70B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
2 |
68.18 |
H100 SXM |
FP8 |
Throughput |
4 |
68.64 |
H100 SXM |
FP8 |
Latency |
8 |
69.77 |
H100 SXM |
FP16 |
Throughput |
4 |
137.94 |
H100 SXM |
FP16 |
Latency |
8 |
146.41 |
A100 SXM |
FP16 |
Throughput |
4 |
137.93 |
A100 SXM |
FP16 |
Latency |
8 |
146.41 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Llama 3.1 Swallow 8B Instruct v0.1#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: 1, 2, 4
Llama 3.1 Swallow 70B Instruct v0.1#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: 2, 4, 8
Llama 3.3 70B Instruct#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM Buildable Profiles#
Precision: BF16
# of GPUs: 4, 8
Meta Llama 3 8B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP16 |
Throughput |
1 |
28 |
H100 SXM |
FP16 |
Latency |
2 |
28 |
A100 SXM |
FP16 |
Throughput |
1 |
28 |
A100 SXM |
FP16 |
Latency |
2 |
28 |
L40S |
FP8 |
Throughput |
1 |
20.5 |
L40S |
FP8 |
Latency |
2 |
20.5 |
L40S |
FP16 |
Throughput |
1 |
28 |
A10G |
FP16 |
Throughput |
1 |
28 |
A10G |
FP16 |
Latency |
2 |
28 |
Generic Configuration#
The Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
24 |
FP16 |
16 |
Llama 3.3 Nemotron Super 49B V1#
GPU |
Precision |
# of GPUs |
---|---|---|
H200 SXM |
BF16 |
2 |
H100 SXM |
BF16 |
2 |
H100 NVL |
BF16 |
2 |
A100 SXM |
BF16 |
2 |
L40S |
BF16 |
4 |
A10G |
BF16 |
8 |
Meta Llama 3 70B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
4 |
82 |
H100 SXM |
FP8 |
Latency |
8 |
82 |
H100 SXM |
FP16 |
Throughput |
4 |
158 |
H100 SXM |
FP16 |
Latency |
8 |
158 |
A100 SXM |
FP16 |
Throughput |
4 |
158 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
240 |
FP16 |
100 SXM |
Mistral 7B Instruct V0.3#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
1 |
7.08 |
H100 SXM |
FP8 |
Latency |
2 |
7.19 |
H100 SXM |
BF16 |
Throughput |
1 |
13.56 |
H100 SXM |
BF16 |
Latency |
2 |
7.19 |
A100 SXM |
BF16 |
Throughput |
1 |
13.56 |
A100 SXM |
BF16 |
Latency |
2 |
13.87 |
L40S |
FP8 |
Throughput |
1 |
7.08 |
L40S |
FP8 |
Latency |
2 |
7.16 |
L40S |
BF16 |
Throughput |
1 |
13.55 |
L40S |
BF16 |
Latency |
2 |
13.85 |
A10G |
BF16 |
Throughput |
2 |
13.87 |
A10G |
BF16 |
Latency |
4 |
14.48 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
24 |
FP16 |
16 |
Mistral NeMo Minitron 8B 8K Instruct#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
1 |
8.91 |
H100 SXM |
FP8 |
Latency |
2 |
9.03 |
H100 SXM |
FP16 |
Throughput |
1 |
15.72 |
H100 SXM |
FP16 |
Latency |
2 |
16.78 |
A100 SXM |
FP16 |
Throughput |
1 |
15.72 |
A100 SXM |
FP16 |
Latency |
2 |
16.78 |
L40S |
FP8 |
Throughput |
1 |
8.92 |
L40S |
FP8 |
Latency |
2 |
9.02 |
L40S |
FP16 |
Throughput |
1 |
15.72 |
L40S |
FP16 |
Latency |
2 |
16.77 |
A10G |
FP16 |
Throughput |
2 |
16.81 |
A10G |
FP16 |
Latency |
4 |
15.72 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Mistral NeMo 12B Instruct RTX#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
NVIDIA RTX 6000 Ada Generation |
INT4 AWQ |
Throughput |
1 |
31 |
GeForce RTX 5090 |
INT4 AWQ |
Throughput |
1 |
31 |
GeForce RTX 5080 |
INT4 AWQ |
Throughput |
1 |
31 |
GeForce RTX 4090 |
INT4 AWQ |
Throughput |
1 |
31 |
GeForce RTX 4080 |
INT4 AWQ |
Throughput |
1 |
31 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Mistral NeMo 12B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Latency |
2 |
13.82 |
H100 SXM |
FP16 |
Throughput |
1 |
23.35 |
H100 SXM |
FP16 |
Latency |
2 |
25.14 |
A100 SXM |
FP16 |
Throughput |
1 |
23.35 |
A100 SXM |
FP16 |
Latency |
2 |
25.14 |
L40S |
FP8 |
Throughput |
2 |
13.83 |
L40S |
FP8 |
Latency |
4 |
15.01 |
L40S |
FP16 |
Throughput |
2 |
25.14 |
L40S |
FP16 |
Latency |
4 |
28.71 |
A10G |
FP16 |
Throughput |
4 |
28.71 |
A10G |
FP16 |
Latency |
8 |
35.87 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Mixtral 8x7B Instruct V0.1#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Latency |
4 |
100 |
H100 SXM |
INT8WO |
Throughput |
2 |
100 |
H100 SXM |
INT8WO |
Latency |
4 |
100 |
H100 SXM |
FP16 |
Throughput |
2 |
100 |
H100 SXM |
FP16 |
Latency |
4 |
100 |
A100 SXM |
FP16 |
Throughput |
2 |
100 |
A100 SXM |
FP16 |
Latency |
4 |
100 |
L40S |
FP8 |
Throughput |
4 |
100 |
L40S |
FP16 |
Throughput |
4 |
100 |
A10G |
FP16 |
Throughput |
8 |
100 |
Generic Configuration#
The GPU Memory and Disk Space values are in GB; Disk Space is for both the container and the model.
GPUs |
GPU Memory |
Precision |
Disk Space |
---|---|---|---|
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory. |
24 |
FP16 |
16 |
Mixtral 8x22B Instruct V0.1#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
8 |
132.61 |
H100 SXM |
FP8 |
Latency |
8 |
132.56 |
H100 SXM |
INT8WO |
Throughput |
8 |
134.82 |
H100 SXM |
INT8WO |
Latency |
8 |
132.31 |
H100 SXM |
FP16 |
Throughput |
8 |
265.59 |
A100 SXM |
FP16 |
Throughput |
8 |
265.7 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
StarCoder2 7B#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 |
BF16 |
Throughput |
1 |
13.89 |
H100 |
BF16 |
Latency |
2 |
14.44 |
H100 |
FP8 |
Throughput |
1 |
7.56 |
H100 |
FP8 |
Latency |
2 |
7.41 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Nemotron 4 340B Instruct#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP16 |
Latency |
16 |
636.45 |
A100 SXM |
FP16 |
Latency |
16 |
636.45 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Nemotron 4 340B Instruct 128K#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
BF16 |
Latency |
16 |
637.26 |
A100 SXM |
BF16 |
Latency |
16 |
637.22 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Nemotron 4 340B Reward#
Optimized Configurations#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP16 |
Latency |
16 |
636.45 |
A100 SXM |
FP16 |
Latency |
16 |
636.45 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Phi 3 Mini 4K Instruct#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
1 |
3.8 |
H100 SXM |
FP16 |
Throughput |
1 |
7.14 |
A100 SXM |
FP16 |
Throughput |
1 |
7.14 |
L40S |
FP8 |
Throughput |
1 |
3.8 |
L40S |
FP16 |
Throughput |
1 |
7.14 |
A10G |
FP16 |
Throughput |
1 |
7.14 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Phind Codellama 34B V2 Instruct#
Profile is for what the model is optimized; the Disk Space is for both the container and the model and the values are in GB.
GPU |
Precision |
Profile |
# of GPUs |
Disk Space |
---|---|---|---|---|
H100 SXM |
FP8 |
Throughput |
2 |
32.17 |
H100 SXM |
FP8 |
Latency |
4 |
32.41 |
H100 SXM |
FP16 |
Throughput |
2 |
63.48 |
H100 SXM |
FP16 |
Latency |
4 |
64.59 |
A100 SXM |
FP16 |
Throughput |
2 |
63.48 |
A100 SXM |
FP16 |
Latency |
4 |
64.59 |
L40S |
FP8 |
Throughput |
4 |
32.43 |
L40S |
FP16 |
Throughput |
4 |
64.58 |
A10G |
FP16 |
Latency |
8 |
66.8 |
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
StarCoderBase 15.5B#
Generic Configuration#
Any NVIDIA GPU should be, but is not guaranteed to be, able to run this model with sufficient GPU memory or multiple, homogeneous NVIDIA GPUs with sufficient aggregate memory, compute capability >= 7.0 (8.0 for bfloat16), and at least one GPU with 95% or greater free memory.
Supported TRT-LLM buildable profiles#
Precision: FP32
# of GPUs: 2, 4, 8