About NeMo AutoModel#
NeMo AutoModel is a PyTorch DTensor-native SPMD (Single Program, Multiple Data) open-source library under NVIDIA NeMo Framework. It provides optimized model implementations with a Hugging Face-compatible API, so any model on the Hub works out of the box with no checkpoint conversion. On top of that, it ships ready-made recipes for training and fine-tuning LLMs and VLMs at scale.
Because AutoModel exposes the same Autoclass interface as transformers, it can also be used as a drop-in accelerated backend for other libraries – reinforcement learning frameworks, evaluation harnesses, or any codebase that loads Hugging Face models.
Target Users#
Machine learning engineers: Fine-tune and pre-train LLMs and VLMs at scale with minimal boilerplate.
Researchers: Rapidly prototype with hackable, linear training scripts and YAML-driven configuration.
Library and framework authors: Use AutoModel’s optimized model implementations as a drop-in replacement for
transformersto accelerate RL, alignment, evaluation, or any downstream workflow.
How It Works#
NeMo AutoModel is built around two core ideas: recipes and components.
Recipes are executable Python scripts paired with YAML configs. Each recipe defines an end-to-end workflow – model loading, data preparation, training loop, and checkpointing – and can be launched with a single command.
Components are modular, self-contained building blocks (datasets, optimizers, loss functions, distribution strategies) that recipes compose together. Swap any component by changing a
_target_field in your YAML.
This design means the training loop is always visible and hackable – no hidden abstractions. You configure parallelism, precision, and scaling through config, not code changes.
SPMD and DTensor#
NeMo AutoModel uses PyTorch’s native SPMD (Single Program, Multiple Data) model with DTensor and DeviceMesh:
One program, any scale: The same training script runs on 1 GPU or 1000+ by changing the mesh configuration.
Parallelism is configuration: Mix tensor, sequence, pipeline, and data parallelism by editing placements – no model rewrites.
Decoupled concerns: Model code stays pure PyTorch; the parallel strategy lives in config.
Key Technologies#
FSDP2 and MegatronFSDP: Memory-efficient sharded data parallelism for large-scale training, including Hybrid Sharding (HSDP).
Pipeline Parallelism: Torch-native pipelining composable with FSDP2 and DTensor for 3D parallelism.
Custom CUDA Kernels: Fused attention, TransformerEngine, DeepEP, and FlexAttn for optimized throughput.
FP8 Mixed Precision: FP8 training via torchao for supported models.
Distributed Checkpointing (DCP): Sharded SafeTensors checkpoints with merge and reshard utilities, interoperable with Hugging Face.
Hugging Face Integration#
NeMo AutoModel builds on top of transformers rather than replacing it:
Load any
AutoModelForCausalLMorAutoModelForImageTextToTextmodel directly from the Hub.Use Hugging Face tokenizers, datasets, and chat templates as-is.
Checkpoints stay in the native Hugging Face format – no conversion step before or after training.
New models released on the Hub get day-0 support because AutoModel tracks the latest
transformersversion.
See the Hugging Face API Compatibility guide and Model Coverage for details.
Optimized Model Implementations#
AutoModel ships optimized implementations for supported architectures (fused attention, TransformerEngine layers, DeepEP for MoE routing, FlexAttn) while keeping the standard transformers API surface. This means:
Faster training and inference with no code changes – load a model the same way you would with
transformersand get accelerated kernels automatically.No checkpoint conversion – weights are loaded from and saved to the native Hugging Face format.
Day-0 model support – because AutoModel builds on
transformers, newly released models on the Hub work immediately. Optimized kernels are added incrementally for popular architectures.
Use as a Library#
NeMo AutoModel is not limited to its built-in training recipes. Because it implements the Hugging Face AutoModel API, any library or framework that loads models through transformers can swap in AutoModel to get optimized performance:
Reinforcement learning (e.g., TRL, OpenRLHF) – replace the policy or reference model with an AutoModel instance for faster rollouts and gradient steps.
Evaluation and benchmarking – plug into lm-evaluation-harness or custom eval loops with no API changes.
Custom training loops – import individual components (optimizers, loss functions, distributed strategies) without using recipes at all.
from nemo_automodel import NeMoAutoModelForCausalLM
model = NeMoAutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
The returned model is a standard nn.Module with the same forward signature as the transformers equivalent, so it works anywhere a Hugging Face model is expected.
What’s Next#
Explore the main features, supported workflows, and core concepts.
Jump to the quickstart table to find the right guide for your task.