Model Configurations Matrix#
This page lists supported models and their recommended GPU configurations, including L40, A100, and H100.
Llama#
The following table lists recommended GPU configurations for Llama models.
Model Name |
Fine-tuning Type |
GPUs |
Nodes |
Tensor Parallel |
Pipeline Parallel |
Max Seq Len |
Micro Batch Size |
---|---|---|---|---|---|---|---|
llama-3.2-1b@v1.0.0+L40 |
lora |
1 |
1 |
1 |
- |
4096 |
1 |
llama-3.2-1b@v1.0.0+L40 |
all_weights |
1 |
1 |
1 |
- |
4096 |
1 |
llama-3.2-1b-instruct@v1.0.0+L40 |
lora |
1 |
1 |
1 |
- |
4096 |
1 |
llama-3.2-1b-instruct@v1.0.0+L40 |
all_weights |
1 |
1 |
1 |
- |
4096 |
1 |
llama-3.2-1b-embedding@0.0.1+L40 |
all_weights |
1 |
1 |
1 |
- |
2048 |
4 |
llama-3.2-3b-instruct@v1.0.0+L40 |
lora |
1 |
1 |
1 |
- |
4096 |
1 |
llama-3.1-8b-instruct@v1.0.0+L40 |
lora |
2 |
1 |
2 |
- |
4096 |
1 |
llama-3.1-8b-instruct@v1.0.0+L40 |
all_weights |
4 |
1 |
4 |
- |
4096 |
1 |
llama3-70b-instruct@v1.0.0+L40 |
lora |
16 |
4 |
4 |
2 |
1400 |
1 |
llama-3.1-70b-instruct@v1.0.0+L40 |
lora |
16 |
4 |
4 |
2 |
1400 |
1 |
llama-3.3-70b-instruct@v1.0.0+L40 |
lora |
16 |
4 |
4 |
2 |
1400 |
1 |
Llama Nemotron#
The following table lists recommended GPU configurations for Nemotron models.
Model Name |
Fine-tuning Type |
GPUs |
Nodes |
Tensor Parallel |
Pipeline Parallel |
Max Seq Len |
Micro Batch Size |
---|---|---|---|---|---|---|---|
nemotron-nano-llama-3.1-8b@v1.0.0+L40 |
lora |
1 |
1 |
1 |
- |
4096 |
1 |
nemotron-nano-llama-3.1-8b@v1.0.0+L40 |
all_weights |
1 |
1 |
1 |
- |
4096 |
1 |
nemotron-super-llama-3.3-49b@v1.0.0+L40 |
lora |
4 |
4 |
4 |
2 |
4096 |
1 |
Phi#
The following table lists recommended GPU configurations for Phi models.
Model Name |
Fine-tuning Type |
GPUs |
Nodes |
Tensor Parallel |
Pipeline Parallel |
Max Seq Len |
Micro Batch Size |
---|---|---|---|---|---|---|---|
phi-4@v1.0.0+L40 |
lora |
1 |
1 |
1 |
- |
4096 |
1 |
Note: For 70B models, the tested configuration was 4 nodes × 4 GPUs (TP=4, PP=2) with a max sequence length of 1400. Using a max sequence length of 4096 causes out-of-memory (OOM) errors even with 4 nodes × 4 GPUs. For 4096 sequence length, it is recommended to use 5 nodes × 4 GPUs (TP=4, PP=5). Adjust resources as needed for your workload.
For the latest and most detailed configuration options, refer to the values.yaml
in the Helm chart.