Key Features and Concepts#
NeMo AutoModel provides GPU-accelerated, transformers-compatible training for LLMs and VLMs. It combines Hugging Face’s model ecosystem with NVIDIA’s optimized training stack, delivering high throughput without sacrificing ease of use.
Why NeMo AutoModel?#
Hugging Face native: Train any model from the Hub with no checkpoint conversion – day-0 support for new releases.
High performance: Custom CUDA kernels (TransformerEngine, DeepEP, FlexAttn) deliver up to 279 TFLOPs/sec/GPU.
Any scale: The same recipe runs on 1 GPU or across hundreds of nodes – parallelism is configuration, not code.
Hackable: Linear training scripts with YAML config. No hidden trainer abstractions.
Open source: Apache 2.0 licensed, NVIDIA-supported, and actively maintained.
Performance Highlights#
Model |
GPUs |
TFLOPs/sec/GPU |
Tokens/sec/GPU |
Optimizations |
|---|---|---|---|---|
DeepSeek V3 671B |
256 |
250 |
1,002 |
TE + DeepEP |
GPT-OSS 20B |
8 |
279 |
13,058 |
TE + DeepEP + FlexAttn |
Qwen3 MoE 30B |
8 |
212 |
11,842 |
TE + DeepEP |
See the full benchmark results for configuration details and more models.
Training Workflows#
NeMo AutoModel supports a range of training tasks across LLM and VLM modalities.
Full-parameter fine-tuning for task-specific adaptation.
Memory-efficient fine-tuning by updating only low-rank adapter weights.
Train models from scratch on large-scale datasets.
Transfer knowledge from a large teacher to a smaller student model.
Fine-tune models for structured function calling with tool schemas.
Train with quantization for deployment-ready models.
Parallelism and Scaling#
NeMo AutoModel leverages PyTorch-native parallelism strategies to scale training from a single GPU to multi-node clusters.
Fully Sharded Data Parallelism with DTensor for memory-efficient distributed training. Supports Hybrid Sharding (HSDP) for multi-node.
Torch-native pipelining composable with FSDP2 and DTensor for 3D parallelism.
FP8 training via torchao for reduced memory and higher throughput on supported models.
Add a slurm: section to any YAML config and launch with the automodel CLI. See the Cluster guide.
Core Concepts#
Recipes#
Recipes are executable Python scripts paired with YAML configuration files. Each recipe defines a complete training workflow:
Load a model and tokenizer from Hugging Face (via
_target_in YAML)Prepare a dataset with the appropriate collator and chat template
Train with a configurable loop (gradient accumulation, validation, logging)
Checkpoint using Distributed Checkpoint (DCP) with SafeTensors output
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
dataset:
_target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
dataset_name: rajpurkar/squad
split: train
Override any field from the CLI:
uv run torchrun --nproc-per-node=8 examples/llm_finetune/finetune.py \
--config examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
--step_scheduler.local_batch_size 16
Components#
Components are modular, self-contained building blocks that recipes assemble:
Component |
Purpose |
|---|---|
|
LLM and VLM datasets with collators, tokenization, and chat templates |
|
FSDP2, MegatronFSDP, tensor/sequence/pipeline parallelism |
|
LoRA and QLoRA implementations |
|
Fused attention, rotary embeddings, FlexAttn |
|
DCP save/load with SafeTensors output |
|
Mixture of Experts routing and DeepEP integration |
|
Optimizers and LR schedulers |
|
Cross-entropy, linear cross-entropy, KD loss |
|
SLURM and interactive job launch |
Each component can be used independently and has no cross-module imports.
The automodel CLI#
The CLI simplifies job launch across environments:
# Single-node interactive
automodel finetune llm -c config.yaml
# Multi-node SLURM batch
automodel finetune llm -c config.yaml # with slurm: section in YAML
See the Local Workstation and Cluster guides.
Checkpointing#
NeMo AutoModel writes Distributed Checkpoints (DCP) with SafeTensors shards. Checkpoints carry partition metadata to:
Merge into a single Hugging Face-compatible checkpoint for inference or sharing.
Reshard when loading onto a different mesh or topology.
Resume training from any checkpoint without manual intervention.
See the Checkpointing guide for details.
Experiment Tracking#
NeMo AutoModel integrates with MLflow and Weights & Biases for experiment tracking, metric logging, and artifact management. See the Experiment Tracking guide.