Key Features and Concepts

View as Markdown

NeMo AutoModel provides GPU-accelerated, transformers-compatible training for LLMs and VLMs. It combines Hugging Face’s model ecosystem with NVIDIA’s optimized training stack, delivering high throughput without sacrificing ease of use.

Why NeMo AutoModel?

  • Hugging Face native: Train any model from the Hub with no checkpoint conversion — day-0 support for new releases.
  • High performance: Custom CUDA kernels (TransformerEngine, DeepEP, FlexAttn) deliver up to 279 TFLOPs/sec/GPU.
  • Any scale: The same recipe runs on 1 GPU or across hundreds of nodes — parallelism is configuration, not code.
  • Hackable: Linear training scripts with YAML config. No hidden trainer abstractions.
  • Open source: Apache 2.0 licensed, NVIDIA-supported, and actively maintained.

Performance Highlights

ModelGPUsTFLOPs/sec/GPUTokens/sec/GPUOptimizations
DeepSeek V3 671B2562501,002TE + DeepEP
GPT-OSS 20B827913,058TE + DeepEP + FlexAttn
Qwen3 MoE 30B821211,842TE + DeepEP

See the full benchmark results for configuration details and more models.


Training Workflows

NeMo AutoModel supports a range of training tasks across LLM and VLM modalities.


Parallelism and Scaling

NeMo AutoModel leverages PyTorch-native parallelism strategies to scale training from a single GPU to multi-node clusters.

FSDP2

Fully Sharded Data Parallelism with DTensor for memory-efficient distributed training. Supports Hybrid Sharding (HSDP) for multi-node.

Pipeline Parallelism

Torch-native pipelining composable with FSDP2 and DTensor for 3D parallelism.

FP8 Mixed Precision

FP8 training via torchao for reduced memory and higher throughput on supported models.

Multi-Node with SLURM

Add a slurm: section to any YAML config and launch with the automodel CLI. See the Cluster guide.


Core Concepts

Recipes

Recipes are executable Python scripts paired with YAML configuration files. Each recipe defines a complete training workflow:

  1. Load a model and tokenizer from Hugging Face (via _target_ in YAML)
  2. Prepare a dataset with the appropriate collator and chat template
  3. Train with a configurable loop (gradient accumulation, validation, logging)
  4. Checkpoint using Distributed Checkpoint (DCP) with SafeTensors output
1recipe:
2 _target_: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
3
4model:
5 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
6 pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
7
8dataset:
9 _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
10 dataset_name: rajpurkar/squad
11 split: train

Override any field from the CLI:

$automodel --nproc-per-node=8 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
> --step_scheduler.local_batch_size 16

Components

Components are modular, self-contained building blocks that recipes assemble:

ComponentPurpose
datasets/LLM and VLM datasets with collators, tokenization, and chat templates
distributed/FSDP2, MegatronFSDP, tensor/sequence/pipeline parallelism
_peft/LoRA and QLoRA implementations
attention/Fused attention, rotary embeddings, FlexAttn
checkpoint/DCP save/load with SafeTensors output
moe/Mixture of Experts routing and DeepEP integration
optim/Optimizers and LR schedulers
loss/Cross-entropy, linear cross-entropy, KD loss
launcher/SLURM and interactive job launch

Each component can be used independently and has no cross-module imports.

The automodel CLI

The CLI simplifies job launch across environments:

$# Single-node interactive
$automodel config.yaml
$
$# Multi-node SLURM batch
$sbatch my_cluster.sub # copy slurm.sub, edit CONFIG & SBATCH directives, then submit

See the Run on Your Local Workstation and Cluster guides.


Checkpointing

NeMo AutoModel writes Distributed Checkpoints (DCP) with SafeTensors shards. Checkpoints carry partition metadata to:

  • Merge into a single Hugging Face-compatible checkpoint for inference or sharing.
  • Reshard when loading onto a different mesh or topology.
  • Resume training from any checkpoint without manual intervention.

See the Checkpointing guide for details.

Experiment Tracking

NeMo AutoModel integrates with MLflow and Weights & Biases for experiment tracking, metric logging, and artifact management. See the Experiment Tracking guide.