NeMo AutoModel Documentation#
PyTorch SPMD (Single Program, Multiple Data) training for LLMs and VLMs with day-0 Hugging Face model support.
Introduction to NeMo AutoModel#
Learn about NeMo AutoModel, how it works at a high-level, and the key features.
Overview of NeMo AutoModel and its capabilities.
Supported workflows, parallelism, recipes, components, and benchmarks.
A transformers-compatible library with accelerated model implementations.
Built on transformers for day-0 model support and OOTB compatibility.
Quickstart#
Select a modality and task to find the right guide.
Performance#
Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.
Model |
GPUs |
TFLOPs/sec/GPU |
Tokens/sec/GPU |
Optimizations |
|---|---|---|---|---|
DeepSeek V3 671B |
256 |
250 |
1,002 |
TE + DeepEP |
GPT-OSS 20B |
8 |
279 |
13,058 |
TE + DeepEP + FlexAttn |
Qwen3 MoE 30B |
8 |
277 |
12,040 |
TE + DeepEP |
See the full benchmark results for configuration details and more models.
Get Started#
Install NeMo AutoModel and launch your first training job.
Install via PyPI, Docker, or from source.
YAML-driven recipes with CLI overrides.
Run on a single GPU or multi-GPU with torchrun.
Multi-node training with SLURM and the automodel CLI.
Bring your own dataset for LLM, VLM, or retrieval training.
Advanced Topics#
Parallelism, precision, checkpointing strategies and experiment tracking.
Torch-native pipelining composable with FSDP2 and DTensor.
Mixed-precision FP8 training with torchao.
Distributed checkpoints with SafeTensors output.
Trade compute for memory with activation checkpointing.
Train with quantization for deployment-ready models.
Track experiments and metrics with MLflow and Wandb.
For Developers#
Components, recipes, and CLI architecture.
Auto-generated Python API documentation.