Llama Nemotron Models#

This page provides detailed technical specifications for the Nemotron model family supported by NeMo Customizer. For information about supported features and capabilities, refer to Tested Models.

Llama 3.1 Nemotron Nano 8B v1#

Property

Value

Creator

NVIDIA

Architecture

transformer

Description

Llama 3.1 Nemotron Nano 8B v1 is a compact, instruction-tuned model for efficient customization and deployment.

Max I/O Tokens

4096

Parameters

8 billion

Training Data

Not specified

Default Name

nvidia/Llama-3.1-Nemotron-Nano-8B-v1

HuggingFace

nvidia/Llama-3.1-Nemotron-Nano-8B-v1

NIM

nvidia/llama-3.1-nemotron-nano-8b-v1

Training Options#

  • LoRA: 1x 80GB GPU, tensor parallel size 1, pipeline parallel size 1

  • Full SFT: 4x 80GB GPU, tensor parallel size 2, pipeline parallel size 1

Deployment Configuration#

  • LoRA:

    • NIM Image: nvcr.io/nim/nvidia/llm-nim:1.15.5

    • GPU Count: 1x 80GB

  • Full SFT:

    • NIM Image: nvcr.io/nim/nvidia/llm-nim:1.15.5

    • GPU Count: 1x 80GB

    • Additional Environment Variables:

      • NIM_MODEL_PROFILE: vllm

NVIDIA Nemotron Nano 9B v2#

Property

Value

Creator

NVIDIA

Architecture

transformer

Description

NVIDIA Nemotron Nano 9B v2 is a compact, instruction-tuned model optimized for efficient customization and deployment.

Max I/O Tokens

4096

Parameters

9 billion

Default Name

nvidia/NVIDIA-Nemotron-Nano-9B-v2

HuggingFace

nvidia/NVIDIA-Nemotron-Nano-9B-v2

NIM

NVIDIA-Nemotron-Nano-9B-v2

Training Options#

  • LoRA: 4x 80GB GPU, tensor parallel size 1, pipeline parallel size 1

  • Full SFT: 4x 80GB GPU, tensor parallel size 2, pipeline parallel size 1

Deployment Configuration#

  • LoRA:

    • NIM Image: nvcr.io/nim/nvidia/llm-nim:1.15.5

    • GPU Count: 1x 80GB

  • Full SFT:

    • NIM Image: nvcr.io/nim/nvidia/llm-nim:1.15.5

    • GPU Count: 1x 80GB

    • Additional Environment Variables:

      • NIM_MODEL_PROFILE: vllm

NVIDIA Nemotron 3 Nano 30B A3B#

Property

Value

Creator

NVIDIA

Architecture

Hybrid Mixture of Experts (MoE) - Mamba-2 + Transformer

Description

Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. Uses configurable reasoning via chat template.

Max I/O Tokens

2048

Parameters

30B total (3.5B active)

MoE Configuration

128 experts + 1 shared expert, 6 experts activated per token

Supported Languages

English, German, Spanish, French, Italian, Japanese

Default Name

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

HuggingFace

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

NIM

Nemotron-3-Nano-30B-A3B

Training Options#

  • LoRA: 2x 80GB GPU, tensor parallel size 1, expert parallel size 2, pipeline parallel size 1

  • Full SFT: 8x 80GB GPU, tensor parallel size 1, expert parallel size 8, pipeline parallel size 1

Note

MoE Parallelism Constraints

MoE models only support expert parallelism for distributing experts across GPUs. When expert_parallel_size > 1, tensor_parallel_size must be set to 1. Additionally, expert_parallel_size must evenly divide the number of GPUs. These constraints apply to training parallelism only and NIM deployment may use different GPU counts optimized for inference.

Deployment Configuration#

  • Full SFT:

    • NIM Image: nvcr.io/nim/nvidia/nemotron-3-nano:1.7.0-variant

    • GPU Count: 2x 80GB

Note

Deployment for LoRA using NIM is not supported for this model.

NVIDIA Nemotron 3 Super 120B A12B#

Property

Value

Creator

NVIDIA

Architecture

Mixture of Experts (MoE)

Description

Nemotron-3-Super-120B-A12B-BF16 is a large MoE language model from NVIDIA designed for high-capacity reasoning and instruction-following tasks.

Max I/O Tokens

4096

Parameters

120B total (12B active)

Default Name

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

HuggingFace

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Training Options#

  • LoRA: 8x 80GB GPU, tensor parallel size 1, expert parallel size 8, pipeline parallel size 1

Note

MoE Parallelism Constraints

MoE models only support expert parallelism for distributing experts across GPUs. When expert_parallel_size > 1, tensor_parallel_size must be set to 1. Additionally, expert_parallel_size must evenly divide the number of GPUs. These constraints apply to training parallelism only and NIM deployment may use different GPU counts optimized for inference.

Deployment Configuration#

  • LoRA:

    • NIM Image: nvcr.io/nim/nvidia/nemotron-3-super-120b-a12b:1.8.1-variant

    • GPU Count: 8x 80GB

    • Additional Environment Variables:

      • NIM_WORKSPACE: /model-store

      • NIM_PIPELINE_PARALLEL_SIZE: 8

      • NIM_MAX_MODEL_LEN: 4096