NeMo AutoModel Documentation
PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training; no checkpoint conversion and no boilerplate.
Quick Links
Overview of NeMo AutoModel and its capabilities.
Supported workflows, parallelism, recipes, and benchmarks.
A transformers-compatible library with accelerated model implementations.
Built on transformers for day-0 model support and OOTB compatibility.
Get Started
See the installation guide for Docker, source builds, and multi-node setup. See the configuration guide for YAML recipes and CLI overrides. Launch on a local workstation or SLURM cluster.
Latest Model Support
New models are added regularly. Pick a model below to start fine-tuning, or see the full release log.
Recipes and Guides
Find the right guide for your task: fine-tuning, pretraining, distillation, diffusion, and more.
Performance
Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.
See the full benchmark results for configuration details and more models.
Advanced Topics
Parallelism, precision, checkpointing strategies, and experiment tracking.
Torch-native pipelining composable with FSDP2 and DTensor. 3d-parallelism
Mixed-precision FP8 training with torchao. FP8 mixed-precision
fp32 master weights, bf16 compute, and the precision traps to avoid. bf16 mixed-precision
Distributed checkpoints with SafeTensors output. DCP safetensors
Trade compute for memory with activation checkpointing. memory-efficiency
Train with quantization for deployment-ready models. QAT
Track experiments and metrics with MLflow and Wandb. MLflow Wandb
For Developers
Components, recipes, and CLI architecture.
Auto-generated Python API documentation.
Drop-in accelerated backend for TRL, lm-eval-harness, OpenRLHF, or any code that loads Hugging Face models.
::
::
::
::
::
::
::
::