NeMo AutoModel Documentation#

PyTorch SPMD (Single Program, Multiple Data) training for LLMs and VLMs with day-0 Hugging Face model support.

Introduction to NeMo AutoModel#

Learn about NeMo AutoModel, how it works at a high-level, and the key features.

About NeMo AutoModel

Overview of NeMo AutoModel and its capabilities.

About NeMo AutoModel
Key Features and Concepts

Supported workflows, parallelism, recipes, components, and benchmarks.

Key Features and Concepts
🤗 Hugging Face Integration

A transformers-compatible library with accelerated model implementations.

🤗 Transformers API Compatibility
Model Coverage

Built on transformers for day-0 model support and OOTB compatibility.

Model Coverage Overview

I Want To…#

Find the right guide for your task.

I want to…

Choose this when…

Input Data

Model

Guide

SFT (full fine-tune)

You need maximum accuracy and have the GPU budget to update all weights

Instruction / chat dataset

LLM

Start fine-tuning

PEFT (LoRA)

You want to fine-tune on limited GPU memory; updates <1 % of parameters

Instruction / chat dataset

LLM

Start LoRA

Tool / function calling

Your model needs to call APIs or tools with structured arguments

Function-calling dataset (queries + tool schemas)

LLM

Add tool calling

Fine-tune VLM

Your task involves both images and text (e.g., visual QA, captioning)

Image + text dataset

VLM

Fine-tune VLM

Fine-tune Diffusion

You want to fine-tune a diffusion model for image or video generation

Video / Image dataset

Diffusion

Fine-tune Diffusion

Fine-tune VLM-MoE

You need large-scale vision-language training with sparse MoE efficiency

Image + text dataset

VLM (MoE)

Fine-tune VLM-MoE

Embedding fine-tune

You want to improve text similarity for search, retrieval, or RAG

Text pairs / retrieval corpus

LLM

Coming Soon

Fine-tune a large MoE

You are adapting a large sparse MoE model (DeepSeek-V3, GLM-5, etc.) to your domain

Text dataset (e.g., HellaSwag)

LLM (MoE)

Fine-tune MoE

Sequence classification

You need to classify text into categories (sentiment, topic, NLI)

Text + labels (e.g., GLUE MRPC)

LLM

Train classifier

QAT fine-tune

You want a quantized model that keeps accuracy for efficient deployment

Text dataset

LLM

Enable QAT

Knowledge distillation

You want a smaller, faster model that retains most of the teacher’s quality

Instruction dataset + teacher model

LLM

Distill a model

Pretrain an LLM

You are building a base model from scratch on your own corpus

Large unlabeled text corpus (e.g., FineWeb-Edu)

LLM

Start pretraining

Pretrain (NanoGPT)

You want quick pretraining experiments on a single node

FineWeb / text corpus

LLM

Try NanoGPT

Performance#

Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.

Model

GPUs

TFLOPs/sec/GPU

Tokens/sec/GPU

Optimizations

DeepSeek V3 671B

256

250

1,002

TE + DeepEP

GPT-OSS 20B

8

279

13,058

TE + DeepEP + FlexAttn

Qwen3 MoE 30B

8

277

12,040

TE + DeepEP

See the full benchmark results for configuration details and more models.

Get Started#

Install NeMo AutoModel and launch your first training job.

Installation

Install via PyPI, Docker, or from source. Use nemo-automodel[cli] for lightweight login-node installs.

Install NeMo AutoModel
Configuration

YAML-driven recipes with CLI overrides.

YAML Configuration
Local Workstation

Run on a single GPU or multi-GPU with torchrun.

Run on Your Local Workstation
Cluster (SLURM)

Multi-node training with SLURM and the automodel CLI.

Run on a Cluster
Datasets

Bring your own dataset for LLM, VLM, or retrieval training.

Dataset Overview: LLM, VLM, and Retrieval Datasets in NeMo Automodel

Advanced Topics#

Parallelism, precision, checkpointing strategies and experiment tracking.

Pipeline Parallelism

Torch-native pipelining composable with FSDP2 and DTensor.

Pipeline Parallelism with AutoPipeline
FP8 Training

Mixed-precision FP8 training with torchao.

FP8 Training in NeMo AutoModel
Checkpointing

Distributed checkpoints with SafeTensors output.

Checkpointing in NeMo Automodel
Gradient Checkpointing

Trade compute for memory with activation checkpointing.

Gradient (Activation) Checkpointing in NeMo AutoModel
Quantization-Aware Training

Train with quantization for deployment-ready models.

Quantization-Aware Training (QAT) in NeMo Automodel
Experiment Tracking

Track experiments and metrics with MLflow and Wandb.

MLflow Logging in NeMo AutoModel

For Developers#

Repo Internals

Components, recipes, and CLI architecture.

Repository Structure
API Reference

Auto-generated Python API documentation.

API Reference