Model Configurations Matrix#
This page lists supported models and their recommended GPU configurations, including L40, A100, and H100.
Llama#
The following table lists recommended GPU configurations for Llama models.
| Model Name | Fine-tuning Type | GPUs | Nodes | Tensor Parallel | Pipeline Parallel | Max Seq Len | Micro Batch Size | 
|---|---|---|---|---|---|---|---|
| llama-3.2-1b@v1.0.0+L40 | lora | 1 | 1 | 1 | - | 4096 | 1 | 
| llama-3.2-1b@v1.0.0+L40 | all_weights | 1 | 1 | 1 | - | 4096 | 1 | 
| llama-3.2-1b-instruct@v1.0.0+L40 | lora | 1 | 1 | 1 | - | 4096 | 1 | 
| llama-3.2-1b-instruct@v1.0.0+L40 | all_weights | 1 | 1 | 1 | - | 4096 | 1 | 
| llama-3.2-1b-embedding@0.0.1+L40 | all_weights | 1 | 1 | 1 | - | 2048 | 4 | 
| llama-3.2-3b-instruct@v1.0.0+L40 | lora | 1 | 1 | 1 | - | 4096 | 1 | 
| llama-3.1-8b-instruct@v1.0.0+L40 | lora | 2 | 1 | 2 | - | 4096 | 1 | 
| llama-3.1-8b-instruct@v1.0.0+L40 | all_weights | 4 | 1 | 4 | - | 4096 | 1 | 
| llama3-70b-instruct@v1.0.0+L40 | lora | 16 | 4 | 4 | 2 | 1400 | 1 | 
| llama-3.1-70b-instruct@v1.0.0+L40 | lora | 16 | 4 | 4 | 2 | 1400 | 1 | 
| llama-3.3-70b-instruct@v1.0.0+L40 | lora | 16 | 4 | 4 | 2 | 1400 | 1 | 
Llama Nemotron#
The following table lists recommended GPU configurations for Nemotron models.
| Model Name | Fine-tuning Type | GPUs | Nodes | Tensor Parallel | Pipeline Parallel | Max Seq Len | Micro Batch Size | 
|---|---|---|---|---|---|---|---|
| nemotron-nano-llama-3.1-8b@v1.0.0+L40 | lora | 1 | 1 | 1 | - | 4096 | 1 | 
| nemotron-nano-llama-3.1-8b@v1.0.0+L40 | all_weights | 1 | 1 | 1 | - | 4096 | 1 | 
| nemotron-super-llama-3.3-49b@v1.0.0+L40 | lora | 4 | 4 | 4 | 2 | 4096 | 1 | 
Phi#
The following table lists recommended GPU configurations for Phi models.
| Model Name | Fine-tuning Type | GPUs | Nodes | Tensor Parallel | Pipeline Parallel | Max Seq Len | Micro Batch Size | 
|---|---|---|---|---|---|---|---|
| phi-4@v1.0.0+L40 | lora | 1 | 1 | 1 | - | 4096 | 1 | 
Note: For 70B models, the tested configuration was 4 nodes × 4 GPUs (TP=4, PP=2) with a max sequence length of 1400. Using a max sequence length of 4096 causes out-of-memory (OOM) errors even with 4 nodes × 4 GPUs. For 4096 sequence length, it is recommended to use 5 nodes × 4 GPUs (TP=4, PP=5). Adjust resources as needed for your workload.
For the latest and most detailed configuration options, refer to the values.yaml in the Helm chart.