Key Features and Concepts

NeMo AutoModel provides GPU-accelerated, transformers-compatible training for LLMs and VLMs. It combines Hugging Face’s model ecosystem with NVIDIA’s optimized training stack, delivering high throughput without sacrificing ease of use.

Why NeMo AutoModel?

Hugging Face native: Train any model from the Hub with no checkpoint conversion — day-0 support for new releases.
High performance: Custom CUDA kernels (TransformerEngine, DeepEP, FlexAttn) deliver up to 279 TFLOPs/sec/GPU.
Any scale: The same recipe runs on 1 GPU or across hundreds of nodes — parallelism is configuration, not code.
Hackable: Linear training scripts with YAML config. No hidden trainer abstractions.
Open source: Apache 2.0 licensed, NVIDIA-supported, and actively maintained.

Performance Highlights

Model	GPUs	TFLOPs/sec/GPU	Tokens/sec/GPU	Optimizations
DeepSeek V3 671B	256	250	1,002	TE + DeepEP
GPT-OSS 20B	8	279	13,058	TE + DeepEP + FlexAttn
Qwen3 MoE 30B	8	212	11,842	TE + DeepEP

See the full benchmark results for configuration details and more models.

Training Workflows

NeMo AutoModel supports a range of training tasks across LLM and VLM modalities.

Supervised Fine-Tuning (SFT)

Full-parameter fine-tuning for task-specific adaptation.

PEFT (LoRA / QLoRA)

Memory-efficient fine-tuning by updating only low-rank adapter weights.

Pre-Training

Train models from scratch on large-scale datasets.

Knowledge Distillation

Transfer knowledge from a large teacher to a smaller student model.

Tool Calling

Fine-tune models for structured function calling with tool schemas.

Quantization-Aware Training

Train with quantization for deployment-ready models.

Parallelism and Scaling

NeMo AutoModel leverages PyTorch-native parallelism strategies to scale training from a single GPU to multi-node clusters.

FSDP2

Fully Sharded Data Parallelism with DTensor for memory-efficient distributed training. Supports Hybrid Sharding (HSDP) for multi-node.

Pipeline Parallelism

Torch-native pipelining composable with FSDP2 and DTensor for 3D parallelism.

FP8 Mixed Precision

FP8 training via torchao for reduced memory and higher throughput on supported models.

Multi-Node with SLURM

Add a slurm: section to any YAML config and launch with the automodel CLI. See the Cluster guide.

Core Concepts

Recipes

Recipes are executable Python scripts paired with YAML configuration files. Each recipe defines a complete training workflow:

Load a model and tokenizer from Hugging Face (via _target_ in YAML)
Prepare a dataset with the appropriate collator and chat template
Train with a configurable loop (gradient accumulation, validation, logging)
Checkpoint using Distributed Checkpoint (DCP) with SafeTensors output

1 recipe:
2   _target_: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
3 
4 model:
5   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
6   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
7 
8 dataset:
9   _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
10   dataset_name: rajpurkar/squad
11   split: train

Override any field from the CLI:

$ automodel --nproc-per-node=8 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
>   --step_scheduler.local_batch_size 16

Components

Components are modular, self-contained building blocks that recipes assemble:

Component	Purpose
`datasets/`	LLM and VLM datasets with collators, tokenization, and chat templates
`distributed/`	FSDP2, MegatronFSDP, tensor/sequence/pipeline parallelism
`_peft/`	LoRA and QLoRA implementations
`attention/`	Fused attention, rotary embeddings, FlexAttn
`checkpoint/`	DCP save/load with SafeTensors output
`moe/`	Mixture of Experts routing and DeepEP integration
`optim/`	Optimizers and LR schedulers
`loss/`	Cross-entropy, linear cross-entropy, KD loss
`launcher/`	SLURM and interactive job launch

Each component can be used independently and has no cross-module imports.

The `automodel` CLI

The CLI simplifies job launch across environments:

$ # Single-node interactive
$ automodel config.yaml
$ 
$ # Multi-node SLURM batch
$ sbatch my_cluster.sub  # copy slurm.sub, edit CONFIG & SBATCH directives, then submit

See the Run on Your Local Workstation and Cluster guides.

Checkpointing

NeMo AutoModel writes Distributed Checkpoints (DCP) with SafeTensors shards. Checkpoints carry partition metadata to:

Merge into a single Hugging Face-compatible checkpoint for inference or sharing.
Reshard when loading onto a different mesh or topology.
Resume training from any checkpoint without manual intervention.

See the Checkpointing guide for details.

Experiment Tracking

NeMo AutoModel integrates with MLflow and Weights & Biases for experiment tracking, metric logging, and artifact management. See the Experiment Tracking guide.