NeMo AutoModel Documentation#
PyTorch SPMD (Single Program, Multiple Data) training for LLMs and VLMs with day-0 Hugging Face model support.
Introduction to NeMo AutoModel#
Learn about NeMo AutoModel, how it works at a high-level, and the key features.
Overview of NeMo AutoModel and its capabilities.
Supported workflows, parallelism, recipes, components, and benchmarks.
A transformers-compatible library with accelerated model implementations.
Built on transformers for day-0 model support and OOTB compatibility.
I Want To…#
Find the right guide for your task.
I want to… |
Choose this when… |
Input Data |
Model |
Guide |
|---|---|---|---|---|
SFT (full fine-tune) |
You need maximum accuracy and have the GPU budget to update all weights |
Instruction / chat dataset |
LLM |
|
PEFT (LoRA) |
You want to fine-tune on limited GPU memory; updates <1 % of parameters |
Instruction / chat dataset |
LLM |
|
Tool / function calling |
Your model needs to call APIs or tools with structured arguments |
Function-calling dataset (queries + tool schemas) |
LLM |
|
Fine-tune VLM |
Your task involves both images and text (e.g., visual QA, captioning) |
Image + text dataset |
VLM |
|
Fine-tune Diffusion |
You want to fine-tune a diffusion model for image or video generation |
Video / Image dataset |
Diffusion |
|
Fine-tune VLM-MoE |
You need large-scale vision-language training with sparse MoE efficiency |
Image + text dataset |
VLM (MoE) |
|
Embedding fine-tune |
You want to improve text similarity for search, retrieval, or RAG |
Text pairs / retrieval corpus |
LLM |
Coming Soon |
Fine-tune a large MoE |
You are adapting a large sparse MoE model (DeepSeek-V3, GLM-5, etc.) to your domain |
Text dataset (e.g., HellaSwag) |
LLM (MoE) |
|
Sequence classification |
You need to classify text into categories (sentiment, topic, NLI) |
Text + labels (e.g., GLUE MRPC) |
LLM |
|
QAT fine-tune |
You want a quantized model that keeps accuracy for efficient deployment |
Text dataset |
LLM |
|
Knowledge distillation |
You want a smaller, faster model that retains most of the teacher’s quality |
Instruction dataset + teacher model |
LLM |
|
Pretrain an LLM |
You are building a base model from scratch on your own corpus |
Large unlabeled text corpus (e.g., FineWeb-Edu) |
LLM |
|
Pretrain (NanoGPT) |
You want quick pretraining experiments on a single node |
FineWeb / text corpus |
LLM |
Performance#
Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.
Model |
GPUs |
TFLOPs/sec/GPU |
Tokens/sec/GPU |
Optimizations |
|---|---|---|---|---|
DeepSeek V3 671B |
256 |
250 |
1,002 |
TE + DeepEP |
GPT-OSS 20B |
8 |
279 |
13,058 |
TE + DeepEP + FlexAttn |
Qwen3 MoE 30B |
8 |
277 |
12,040 |
TE + DeepEP |
See the full benchmark results for configuration details and more models.
Get Started#
Install NeMo AutoModel and launch your first training job.
Install via PyPI, Docker, or from source. Use nemo-automodel[cli] for lightweight login-node installs.
YAML-driven recipes with CLI overrides.
Run on a single GPU or multi-GPU with torchrun.
Multi-node training with SLURM and the automodel CLI.
Bring your own dataset for LLM, VLM, or retrieval training.
Advanced Topics#
Parallelism, precision, checkpointing strategies and experiment tracking.
Torch-native pipelining composable with FSDP2 and DTensor.
Mixed-precision FP8 training with torchao.
Distributed checkpoints with SafeTensors output.
Trade compute for memory with activation checkpointing.
Train with quantization for deployment-ready models.
Track experiments and metrics with MLflow and Wandb.
For Developers#
Components, recipes, and CLI architecture.
Auto-generated Python API documentation.