Key Features and Concepts
NeMo AutoModel provides GPU-accelerated, transformers-compatible training for LLMs and VLMs. It combines Hugging Face’s model ecosystem with NVIDIA’s optimized training stack, delivering high throughput without sacrificing ease of use.
Why NeMo AutoModel?
- Hugging Face native: Train any model from the Hub with no checkpoint conversion — day-0 support for new releases.
- High performance: Custom CUDA kernels (TransformerEngine, DeepEP, FlexAttn) deliver up to 279 TFLOPs/sec/GPU.
- Any scale: The same recipe runs on 1 GPU or across hundreds of nodes — parallelism is configuration, not code.
- Hackable: Linear training scripts with YAML config. No hidden trainer abstractions.
- Open source: Apache 2.0 licensed, NVIDIA-supported, and actively maintained.
Performance Highlights
See the full benchmark results for configuration details and more models.
Training Workflows
NeMo AutoModel supports a range of training tasks across LLM and VLM modalities.
Full-parameter fine-tuning for task-specific adaptation.
Memory-efficient fine-tuning by updating only low-rank adapter weights.
Train models from scratch on large-scale datasets.
Transfer knowledge from a large teacher to a smaller student model.
Fine-tune models for structured function calling with tool schemas.
Train with quantization for deployment-ready models.
Parallelism and Scaling
NeMo AutoModel leverages PyTorch-native parallelism strategies to scale training from a single GPU to multi-node clusters.
Fully Sharded Data Parallelism with DTensor for memory-efficient distributed training. Supports Hybrid Sharding (HSDP) for multi-node.
Torch-native pipelining composable with FSDP2 and DTensor for 3D parallelism.
FP8 training via torchao for reduced memory and higher throughput on supported models.
Add a slurm: section to any YAML config and launch with the automodel CLI. See the Cluster guide.
Core Concepts
Recipes
Recipes are executable Python scripts paired with YAML configuration files. Each recipe defines a complete training workflow:
- Load a model and tokenizer from Hugging Face (via
_target_in YAML) - Prepare a dataset with the appropriate collator and chat template
- Train with a configurable loop (gradient accumulation, validation, logging)
- Checkpoint using Distributed Checkpoint (DCP) with SafeTensors output
Override any field from the CLI:
Components
Components are modular, self-contained building blocks that recipes assemble:
Each component can be used independently and has no cross-module imports.
The automodel CLI
The CLI simplifies job launch across environments:
See the Run on Your Local Workstation and Cluster guides.
Checkpointing
NeMo AutoModel writes Distributed Checkpoints (DCP) with SafeTensors shards. Checkpoints carry partition metadata to:
- Merge into a single Hugging Face-compatible checkpoint for inference or sharing.
- Reshard when loading onto a different mesh or topology.
- Resume training from any checkpoint without manual intervention.
See the Checkpointing guide for details.
Experiment Tracking
NeMo AutoModel integrates with MLflow and Weights & Biases for experiment tracking, metric logging, and artifact management. See the Experiment Tracking guide.