NeMo AutoModel Documentation#
PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training; no checkpoint conversion and no boilerplate.
Quick links: 🤗 HF Compatible | 🚀 Performance | 📐 Scalability | 🎯 SFT & PEFT | 🎨 Diffusion | 👁️ VLM | 🌐 Omni | 🌊 dLLM
Overview of NeMo AutoModel and its capabilities.
Supported workflows, parallelism, recipes, and benchmarks.
A transformers-compatible library with accelerated model implementations.
Built on transformers for day-0 model support and OOTB compatibility.
Get Started#
uv pip install nemo-automodel
automodel --nproc-per-node=2 llama3_2_1b_squad.yaml
See the installation guide for Docker, source builds, and multi-node setup. See the configuration guide for YAML recipes and CLI overrides. Launch on a local workstation or SLURM cluster.
Latest Model Support#
New models are added regularly. Pick a model below to start fine-tuning, or see the full release log.
Recipes & Guides#
Find the right guide for your task: fine-tuning, pretraining, distillation, diffusion, and more.
I want to… |
Choose this when… |
Input Data |
Model |
Guide |
|---|---|---|---|---|
SFT (full fine-tune) |
You need maximum accuracy and have the GPU budget to update all weights |
Instruction / chat dataset |
LLM |
|
PEFT (LoRA) |
You want to fine-tune on limited GPU memory; updates <1 % of parameters |
Instruction / chat dataset |
LLM |
|
Tool / function calling |
Your model needs to call APIs or tools with structured arguments |
Function-calling dataset (queries + tool schemas) |
LLM |
|
Fine-tune VLM |
Your task involves both images and text (e.g., visual QA, captioning) |
Image + text dataset |
VLM |
|
Fine-tune Gemma 4 |
You want to fine-tune Gemma 4 for structured extraction from images (e.g., receipts) |
Image + text dataset |
VLM |
|
Fine-tune dLLM |
You want to fine-tune a diffusion language model (e.g., LLaDA) using masked denoising |
Instruction / chat dataset |
dLLM |
|
Fine-tune Diffusion |
You want to fine-tune a diffusion model for image or video generation |
Video / Image dataset |
Diffusion |
|
Fine-tune VLM-MoE |
You need large-scale vision-language training with sparse MoE efficiency |
Image + text dataset |
VLM (MoE) |
|
Fine-tune agentic VLM-MoE |
You need image/video context for agentic developer workflows |
Image / video + text dataset |
VLM (MoE) |
|
Fine-tune Audio ASR |
Adapt Qwen3-Omni for speech recognition on HF audio datasets |
Audio + transcript dataset |
Qwen3-Omni |
|
Embedding fine-tune |
You want to improve text similarity for search, retrieval, or RAG |
Text pairs / retrieval corpus |
LLM |
Coming Soon |
Fine-tune a large MoE |
You are adapting a large sparse MoE model (DeepSeek-V3, GLM-5, etc.) to your domain |
Text dataset (e.g., HellaSwag) |
LLM (MoE) |
|
Fine-tune DeepSeek V4 Flash |
You want to fine-tune the DeepSeek V4 Flash hybrid-attention MoE (SWA / CSA / HCA + hash-routing) |
Text dataset (e.g., HellaSwag) |
LLM (MoE) |
|
Fine-tune Hy3-preview |
You want to fine-tune Tencent’s 295B MoE with sigmoid routing and per-head QK RMSNorm |
Text dataset (e.g., HellaSwag) |
LLM (MoE) |
|
Sequence classification |
You need to classify text into categories (sentiment, topic, NLI) |
Text + labels (e.g., GLUE MRPC) |
LLM |
|
QAT fine-tune |
You want a quantized model that keeps accuracy for efficient deployment |
Text dataset |
LLM |
|
Knowledge distillation |
You want a smaller, faster model that retains most of the teacher’s quality |
Instruction dataset + teacher model |
LLM |
|
Pretrain an LLM |
You are building a base model from scratch on your own corpus |
Large unlabeled text corpus (e.g., FineWeb-Edu) |
LLM |
|
Pretrain (NanoGPT) |
You want quick pretraining experiments on a single node |
FineWeb / text corpus |
LLM |
Performance#
Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.
Model |
GPUs |
TFLOPs/sec/GPU |
Tokens/sec/GPU |
Optimizations |
|---|---|---|---|---|
DeepSeek V3 671B |
256 |
250 |
1,002 |
TE + DeepEP |
GPT-OSS 20B |
8 |
279 |
13,058 |
TE + DeepEP + FlexAttn |
Qwen3 MoE 30B |
8 |
277 |
12,040 |
TE + DeepEP |
See the full benchmark results for configuration details and more models.
Advanced Topics#
Parallelism, precision, checkpointing strategies, and experiment tracking.
Torch-native pipelining composable with FSDP2 and DTensor.
Mixed-precision FP8 training with torchao.
fp32 master weights, bf16 compute, and the precision traps to avoid.
Distributed checkpoints with SafeTensors output.
Trade compute for memory with activation checkpointing.
Train with quantization for deployment-ready models.
Track experiments and metrics with MLflow and Wandb.
For Developers#
Components, recipes, and CLI architecture.
Auto-generated Python API documentation.
Drop-in accelerated backend for TRL, lm-eval-harness, OpenRLHF, or any code that loads Hugging Face models.