NeMo AutoModel Documentation#

PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training – no checkpoint conversion, no boilerplate. Quick links: 🤗 HF Compatible | 🚀 Performance | 📐 Scalability | 🎯 SFT & PEFT | 🎨 Diffusion | 👁️ VLM

About

Overview of NeMo AutoModel and its capabilities.

About NeMo AutoModel
Key Features

Supported workflows, parallelism, recipes, and benchmarks.

Key Features and Concepts
🤗 HF Integration

A transformers-compatible library with accelerated model implementations.

🤗 Transformers API Compatibility
Model Coverage

Built on transformers for day-0 model support and OOTB compatibility.

Model Coverage Overview

Get Started#

uv pip install nemo-automodel

automodel --nproc-per-node=2 llama3_2_1b_squad.yaml

See the installation guide for Docker, source builds, and multi-node setup. See the configuration guide for YAML recipes and CLI overrides. Launch on a local workstation or SLURM cluster.

Latest Model Support#

New models are added regularly. Pick a model below to start fine-tuning, or see the full release log.

Date

Modality

Model

2026-04-07

LLM

GLM-5.1 (recipe)

2026-04-02

VLM

Gemma 4 (recipe)

2026-03-16

VLM

Mistral Small 4 (recipe)

2026-03-11

LLM

Nemotron Super v3 (recipe)

2026-03-03

Diffusion

FLUX.1-dev (recipe)

Recipes & Guides#

Find the right guide for your task – fine-tuning, pretraining, distillation, diffusion, and more.

I want to…

Choose this when…

Input Data

Model

Guide

SFT (full fine-tune)

You need maximum accuracy and have the GPU budget to update all weights

Instruction / chat dataset

LLM

Start fine-tuning

PEFT (LoRA)

You want to fine-tune on limited GPU memory; updates <1 % of parameters

Instruction / chat dataset

LLM

Start LoRA

Tool / function calling

Your model needs to call APIs or tools with structured arguments

Function-calling dataset (queries + tool schemas)

LLM

Add tool calling

Fine-tune VLM

Your task involves both images and text (e.g., visual QA, captioning)

Image + text dataset

VLM

Fine-tune VLM

Fine-tune Gemma 4

You want to fine-tune Gemma 4 for structured extraction from images (e.g., receipts)

Image + text dataset

VLM

Fine-tune Gemma 4

Fine-tune dLLM

You want to fine-tune a diffusion language model (e.g., LLaDA) using masked denoising

Instruction / chat dataset

dLLM

Fine-tune dLLM

Fine-tune Diffusion

You want to fine-tune a diffusion model for image or video generation

Video / Image dataset

Diffusion

Fine-tune Diffusion

Fine-tune VLM-MoE

You need large-scale vision-language training with sparse MoE efficiency

Image + text dataset

VLM (MoE)

Fine-tune VLM-MoE

Embedding fine-tune

You want to improve text similarity for search, retrieval, or RAG

Text pairs / retrieval corpus

LLM

Coming Soon

Fine-tune a large MoE

You are adapting a large sparse MoE model (DeepSeek-V3, GLM-5, etc.) to your domain

Text dataset (e.g., HellaSwag)

LLM (MoE)

Fine-tune MoE

Sequence classification

You need to classify text into categories (sentiment, topic, NLI)

Text + labels (e.g., GLUE MRPC)

LLM

Train classifier

QAT fine-tune

You want a quantized model that keeps accuracy for efficient deployment

Text dataset

LLM

Enable QAT

Knowledge distillation

You want a smaller, faster model that retains most of the teacher’s quality

Instruction dataset + teacher model

LLM

Distill a model

Pretrain an LLM

You are building a base model from scratch on your own corpus

Large unlabeled text corpus (e.g., FineWeb-Edu)

LLM

Start pretraining

Pretrain (NanoGPT)

You want quick pretraining experiments on a single node

FineWeb / text corpus

LLM

Try NanoGPT

Performance#

Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.

Model

GPUs

TFLOPs/sec/GPU

Tokens/sec/GPU

Optimizations

DeepSeek V3 671B

256

250

1,002

TE + DeepEP

GPT-OSS 20B

8

279

13,058

TE + DeepEP + FlexAttn

Qwen3 MoE 30B

8

277

12,040

TE + DeepEP

See the full benchmark results for configuration details and more models.

Advanced Topics#

Parallelism, precision, checkpointing strategies and experiment tracking.

Pipeline Parallelism

Torch-native pipelining composable with FSDP2 and DTensor.

Pipeline Parallelism with AutoPipeline
FP8 Training

Mixed-precision FP8 training with torchao.

FP8 Training
Checkpointing

Distributed checkpoints with SafeTensors output.

Checkpointing
Gradient Checkpointing

Trade compute for memory with activation checkpointing.

Gradient (Activation) Checkpointing
Quantization-Aware Training

Train with quantization for deployment-ready models.

Quantization-Aware Training (QAT)
Experiment Tracking

Track experiments and metrics with MLflow and Wandb.

MLflow Logging

For Developers#

Repo Internals

Components, recipes, and CLI architecture.

Repository Structure
API Reference

Auto-generated Python API documentation.

API Reference
Use as a Library

Drop-in accelerated backend for TRL, lm-eval-harness, OpenRLHF, or any code that loads Hugging Face models.

About NeMo AutoModel