NeMo AutoModel Documentation#

PyTorch SPMD (Single Program, Multiple Data) training for LLMs and VLMs with day-0 Hugging Face model support.

Introduction to NeMo AutoModel#

Learn about NeMo AutoModel, how it works at a high-level, and the key features.

About NeMo AutoModel

Overview of NeMo AutoModel and its capabilities.

About NeMo AutoModel
Key Features and Concepts

Supported workflows, parallelism, recipes, components, and benchmarks.

Key Features and Concepts
🤗 Hugging Face Integration

A transformers-compatible library with accelerated model implementations.

🤗 Transformers API Compatibility
Model Coverage

Built on transformers for day-0 model support and OOTB compatibility.

Model Coverage Overview

Quickstart#

Select a modality and task to find the right guide.

SFT

PEFT (LoRA)

Tool Calling

QAT

Knowledge Distillation

Pretrain

LLM

Guide

Guide

Guide

Guide

Guide

Guide

VLM

Guide

Guide

–

–

–

–

Performance#

Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.

Model

GPUs

TFLOPs/sec/GPU

Tokens/sec/GPU

Optimizations

DeepSeek V3 671B

256

250

1,002

TE + DeepEP

GPT-OSS 20B

8

279

13,058

TE + DeepEP + FlexAttn

Qwen3 MoE 30B

8

277

12,040

TE + DeepEP

See the full benchmark results for configuration details and more models.

Get Started#

Install NeMo AutoModel and launch your first training job.

Installation

Install via PyPI, Docker, or from source.

Install NeMo AutoModel
Configuration

YAML-driven recipes with CLI overrides.

YAML Configuration
Local Workstation

Run on a single GPU or multi-GPU with torchrun.

Run on Your Local Workstation
Cluster (SLURM)

Multi-node training with SLURM and the automodel CLI.

Run on a Cluster
Datasets

Bring your own dataset for LLM, VLM, or retrieval training.

Dataset Overview: LLM, VLM, and Retrieval Datasets in NeMo Automodel

Advanced Topics#

Parallelism, precision, checkpointing strategies and experiment tracking.

Pipeline Parallelism

Torch-native pipelining composable with FSDP2 and DTensor.

Pipeline Parallelism with AutoPipeline
FP8 Training

Mixed-precision FP8 training with torchao.

FP8 Training in NeMo Automodel
Checkpointing

Distributed checkpoints with SafeTensors output.

Checkpointing in NeMo Automodel
Gradient Checkpointing

Trade compute for memory with activation checkpointing.

🚀 Gradient (Activation) Checkpointing in NeMo-AutoModel
Quantization-Aware Training

Train with quantization for deployment-ready models.

Quantization-Aware Training (QAT) in NeMo Automodel
Experiment Tracking

Track experiments and metrics with MLflow and Wandb.

MLflow Logging in NeMo Automodel

For Developers#

Repo Internals

Components, recipes, and CLI architecture.

Repository Structure
API Reference

Auto-generated Python API documentation.

API Reference