Key Features and Concepts#

NeMo AutoModel provides GPU-accelerated, transformers-compatible training for LLMs and VLMs. It combines Hugging Face’s model ecosystem with NVIDIA’s optimized training stack, delivering high throughput without sacrificing ease of use.

Why NeMo AutoModel?#

Hugging Face native: Train any model from the Hub with no checkpoint conversion – day-0 support for new releases.
High performance: Custom CUDA kernels (TransformerEngine, DeepEP, FlexAttn) deliver up to 279 TFLOPs/sec/GPU.
Any scale: The same recipe runs on 1 GPU or across hundreds of nodes – parallelism is configuration, not code.
Hackable: Linear training scripts with YAML config. No hidden trainer abstractions.
Open source: Apache 2.0 licensed, NVIDIA-supported, and actively maintained.

Performance Highlights#

Model	GPUs	TFLOPs/sec/GPU	Tokens/sec/GPU	Optimizations
DeepSeek V3 671B	256	250	1,002	TE + DeepEP
GPT-OSS 20B	8	279	13,058	TE + DeepEP + FlexAttn
Qwen3 MoE 30B	8	212	11,842	TE + DeepEP

See the full benchmark results for configuration details and more models.

Training Workflows#

NeMo AutoModel supports a range of training tasks across LLM and VLM modalities.

Supervised Fine-Tuning (SFT)

Full-parameter fine-tuning for task-specific adaptation.

Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT)

PEFT (LoRA / QLoRA)

Memory-efficient fine-tuning by updating only low-rank adapter weights.

Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT)

Pre-Training

Train models from scratch on large-scale datasets.

Pretraining Megatron Core Datasets

Knowledge Distillation

Transfer knowledge from a large teacher to a smaller student model.

Knowledge Distillation

Tool Calling

Fine-tune models for structured function calling with tool schemas.

Function Calling with FunctionGemma

Quantization-Aware Training

Train with quantization for deployment-ready models.

Quantization-Aware Training (QAT)

Parallelism and Scaling#

NeMo AutoModel leverages PyTorch-native parallelism strategies to scale training from a single GPU to multi-node clusters.

FSDP2

Fully Sharded Data Parallelism with DTensor for memory-efficient distributed training. Supports Hybrid Sharding (HSDP) for multi-node.

Pipeline Parallelism

Torch-native pipelining composable with FSDP2 and DTensor for 3D parallelism.

FP8 Mixed Precision

FP8 training via torchao for reduced memory and higher throughput on supported models.

Multi-Node with SLURM

Add a slurm: section to any YAML config and launch with the automodel CLI. See the Cluster guide.

Core Concepts#

Recipes#

Recipes are executable Python scripts paired with YAML configuration files. Each recipe defines a complete training workflow:

Load a model and tokenizer from Hugging Face (via _target_ in YAML)
Prepare a dataset with the appropriate collator and chat template
Train with a configurable loop (gradient accumulation, validation, logging)
Checkpoint using Distributed Checkpoint (DCP) with SafeTensors output

recipe:
  _target_: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: meta-llama/Llama-3.2-1B

dataset:
  _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
  dataset_name: rajpurkar/squad
  split: train

Override any field from the CLI:

automodel --nproc-per-node=8 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
  --step_scheduler.local_batch_size 16

Components#

Components are modular, self-contained building blocks that recipes assemble:

Component	Purpose
`datasets/`	LLM and VLM datasets with collators, tokenization, and chat templates
`distributed/`	FSDP2, MegatronFSDP, tensor/sequence/pipeline parallelism
`_peft/`	LoRA and QLoRA implementations
`attention/`	Fused attention, rotary embeddings, FlexAttn
`checkpoint/`	DCP save/load with SafeTensors output
`moe/`	Mixture of Experts routing and DeepEP integration
`optim/`	Optimizers and LR schedulers
`loss/`	Cross-entropy, linear cross-entropy, KD loss
`launcher/`	SLURM and interactive job launch

Each component can be used independently and has no cross-module imports.

The `automodel` CLI#

The CLI simplifies job launch across environments:

# Single-node interactive
automodel config.yaml

# Multi-node SLURM batch
sbatch my_cluster.sub  # copy slurm.sub, edit CONFIG & SBATCH directives, then submit

See the Local Workstation and Cluster guides.

Checkpointing#

NeMo AutoModel writes Distributed Checkpoints (DCP) with SafeTensors shards. Checkpoints carry partition metadata to:

Merge into a single Hugging Face-compatible checkpoint for inference or sharing.
Reshard when loading onto a different mesh or topology.
Resume training from any checkpoint without manual intervention.

See the Checkpointing guide for details.

Experiment Tracking#

NeMo AutoModel integrates with MLflow and Weights & Biases for experiment tracking, metric logging, and artifact management. See the Experiment Tracking guide.