Llama Nemotron Models#
This page provides detailed technical specifications for the Nemotron model family supported by NeMo Customizer. For information about supported features and capabilities, refer to Tested Models.
Llama 3.1 Nemotron Nano 8B v1#
Property |
Value |
|---|---|
Creator |
NVIDIA |
Architecture |
transformer |
Description |
Llama 3.1 Nemotron Nano 8B v1 is a compact, instruction-tuned model for efficient customization and deployment. |
Max I/O Tokens |
4096 |
Parameters |
8 billion |
Training Data |
Not specified |
Recommended GPUs for Customization |
1 (LoRA), 8 (All Weights) |
Default Name |
nvidia/nemotron-nano-llama-3.1-8b@1.0 |
Version |
|
NIM |
Training Options#
LoRA: 1 GPU, tensor parallel size 1
All Weights: 8 GPUs, tensor parallel size 4
Llama 3.3 Nemotron Super 49B v1#
Property |
Value |
|---|---|
Creator |
NVIDIA |
Architecture |
transformer |
Description |
Llama 3.3 Nemotron Super 49B v1 is a large, instruction-tuned model for advanced dialogue and reasoning tasks. |
Max I/O Tokens |
4096 |
Parameters |
49 billion |
Training Data |
Not specified |
Recommended GPUs for Customization |
4 (LoRA) |
Default Name |
nvidia/nemotron-super-llama-3.3-49b@1.0 |
Version |
|
NIM |
Training Options#
LoRA: 4 GPUs, tensor parallel size 4
Llama 3.3 Nemotron Super 49B v1.5#
Property |
Value |
|---|---|
Creator |
NVIDIA |
Architecture |
transformer |
Description |
Llama 3.3 Nemotron Super 49B v1.5 is a large, instruction-tuned model for advanced dialogue and reasoning tasks. |
Max I/O Tokens |
4096 |
Parameters |
49 billion |
Training Data |
Not specified |
Recommended GPUs for Customization |
4 (LoRA) |
Default Name |
nvidia/nemotron-super-llama-3.3-49b@1.5 |
Version |
|
NIM |
Training Options#
LoRA: 4 GPUs, tensor parallel size 4
NVIDIA Nemotron 3 Nano 30B A3B#
Property |
Value |
|---|---|
Creator |
NVIDIA |
Architecture |
Hybrid Mixture of Experts (MoE) - Mamba-2 + Transformer |
Description |
Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. Uses configurable reasoning via chat template. |
Max I/O Tokens |
2048 |
Parameters |
30B total (3.5B active) |
MoE Configuration |
128 experts + 1 shared expert, 6 experts activated per token |
Supported Languages |
English, German, Spanish, French, Italian, Japanese |
Recommended GPUs for Customization |
1 (LoRA), 8 (All Weights) |
Default Name |
nvidia/nemotron3-nano-30b@v3 |
HF Model URI |
|
HuggingFace |
|
NIM |
Only supported with All Weights |
Training Options#
LoRA: 1 GPU, tensor parallel size 1, expert model parallel size 1, pipeline parallel size 1
All Weights: 8 GPUs, tensor parallel size 1, pipeline parallel size 1, expert model parallel size 1
Known Limitations#
Warning
Full SFT (All Weights) Training Limitation
For All Weights fine-tuning, batch size is limited to:
batch_size = data_parallel_size × micro_batch_size
If data_parallel_size is not explicitly set, it is automatically calculated as:
data_parallel_size = total_gpus / (tensor_parallel_size × pipeline_parallel_size)
For example, with the default configuration (8 GPUs, tensor_parallel_size=1, pipeline_parallel_size=1, micro_batch_size=1), the calculated data_parallel_size is 8, so set batch size to 8.
Warning
NIM Deployment on A100 GPUs
When deploying the fine-tuned model using the Deployment Management Service, the NIM by default pulls a profile compatible with FP8 precision (precision=fp8). FP8 requires GPU compute capability 89+ (such as H100).
A100 GPUs have compute capability 80 and do not support FP8, resulting in the following error:
Value error, The quantization method modelopt is not supported for the current GPU.
Minimum capability: 89. Current capability: 80.
When deploying to A100 GPUs, add the following environment variable to select the BF16 precision profile:
"additional_envs": {
"NIM_TAGS_SELECTOR": "precision=bf16"
}
Helm Configuration#
To enable this model for fine-tuning, add the following to your Helm overrides file:
customizer:
customizationTargets:
hfTargetDownload:
# -- set this to true to allow model downloads from Hugging Face
enabled: true
# -- List of allowed organizations for model downloads from Hugging Face
allowedHfOrgs:
- "nvidia"
trustedModelURIs:
- hf://nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
overrideExistingTargets: true
targets:
nvidia/nemotron3-nano-30b@v3:
name: nemotron3-nano-30b@v3
namespace: nvidia
enabled: true
model_uri: hf://nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
hf_endpoint: https://huggingface.co
model_path: nemotron_nano_30b
base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
num_parameters: 30000000000
precision: bf16-mixed
customizationConfigTemplates:
overrideExistingTemplates: true
templates:
nvidia/nemotron3-nano-30b@v3.0.0+80GB:
name: nemotron3-nano-30b@v3.0.0+80GB
namespace: nvidia
target: nvidia/nemotron3-nano-30b@v3
training_options:
- training_type: sft
finetuning_type: lora
num_gpus: 1
num_nodes: 1
tensor_parallel_size: 1
micro_batch_size: 1
expert_model_parallel_size: 1
pipeline_parallel_size: 1
- training_type: sft
finetuning_type: all_weights
num_gpus: 8
num_nodes: 1
tensor_parallel_size: 1
micro_batch_size: 1
expert_model_parallel_size: 1
pipeline_parallel_size: 1
max_seq_length: 2048
prompt_template: "{prompt} {completion}"