> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# NeMo AutoModel Documentation

> NeMo AutoModel is a PyTorch DTensor-native SPMD open-source training library for scalable LLM and VLM training and fine-tuning with day-0 Hugging Face model support

PyTorch-native training that scales from 1 GPU to thousands with a single config change. Load any Hugging Face model, point at your data, and start training; no checkpoint conversion and no boilerplate.

## Quick Links

🤗 HF Compatible

🚀 Performance

📐 Scalability

🎯 SFT & PEFT

🎨 Diffusion

👁️ VLM

🌐 Omni

🌊 dLLM

🔊 Audio

⚡ Speculative

Overview of NeMo AutoModel and its capabilities.

Supported workflows, parallelism, recipes, and benchmarks.

A `transformers`-compatible library with accelerated model implementations.

Built on `transformers` for day-0 model support and OOTB compatibility.

## Get Started

```bash
uv pip install nemo-automodel

automodel --nproc-per-node=2 llama3_2_1b_squad.yaml
```

See the [installation guide](/get-started/installation) for Docker, source builds, and multi-node setup.
See the [configuration guide](/get-started/configuration) for YAML recipes and CLI overrides.
Launch on a [local workstation](/job-launchers/local-workstation) or [SLURM cluster](/job-launchers/slurm-cluster).

## Latest Model Support

New models are added regularly. Pick a model below to start fine-tuning, or see the [full release log](/model-coverage/release-log).

| Date       | Modality  | Model                                                                                                                                                                                                        |
| ---------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 2026-05-19 | LLM       | Ling 2.0 ([recipes](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/ling))                                                                                                          |
| 2026-05-18 | Audio     | Qwen3-Omni ASR ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/audio_finetune/qwen3_omni_asr/ami_sft.yaml))                                                                            |
| 2026-05-17 | LLM       | ERNIE 4.5 ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/ernie4_5/ernie4_5_21b_a3b_hellaswag.yaml))                                                                      |
| 2026-05-17 | LLM       | MiMo-V2-Flash ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/mimo_v2_flash/mimo_v2_flash_hellaswag.yaml))                                                                |
| 2026-04-07 | LLM       | [GLM-5.1](https://github.com/NVIDIA-NeMo/Automodel/discussions/1719) ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/glm/glm_5.1_hellaswag_pp.yaml))                      |
| 2026-04-02 | VLM       | Gemma 4 ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/gemma4/gemma4_4b.yaml))                                                                                           |
| 2026-03-16 | VLM       | [Mistral Small 4](https://github.com/NVIDIA-NeMo/Automodel/discussions/1558) ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral4/mistral4_medpix.yaml))              |
| 2026-03-11 | LLM       | [Nemotron Super v3](https://github.com/NVIDIA-NeMo/Automodel/discussions/976) ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/nemotron/nemotron_super_v3_hellaswag.yaml)) |
| 2026-03-03 | Diffusion | FLUX.1-dev ([recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/diffusion/finetune/flux_t2i_flow.yaml))                                                                                     |

## Recipes and Guides

Find the right guide for your task: fine-tuning, pretraining, distillation, diffusion, and more.

| I want to...                    | Choose this when...                                                                               | Input Data                                        | Model      | Guide                                                                  |
| ------------------------------- | ------------------------------------------------------------------------------------------------- | ------------------------------------------------- | ---------- | ---------------------------------------------------------------------- |
| **SFT (full fine-tune)**        | You need maximum accuracy and have the GPU budget to update all weights                           | Instruction / chat dataset                        | LLM        | [Start fine-tuning](/recipes-e2e-examples/sft-peft)                    |
| **PEFT (LoRA)**                 | You want to fine-tune on limited GPU memory; updates \<1 % of parameters                          | Instruction / chat dataset                        | LLM        | [Start LoRA](/recipes-e2e-examples/sft-peft)                           |
| **Tool / function calling**     | Your model needs to call APIs or tools with structured arguments                                  | Function-calling dataset (queries + tool schemas) | LLM        | [Add tool calling](/recipes-e2e-examples/function-calling)             |
| **Fine-tune VLM**               | Your task involves both images and text (e.g., visual QA, captioning)                             | Image + text dataset                              | VLM        | [Fine-tune VLM](/recipes-e2e-examples/gemma-3-3n)                      |
| **Fine-tune Gemma 4**           | You want to fine-tune Gemma 4 for structured extraction from images (e.g., receipts)              | Image + text dataset                              | VLM        | [Fine-tune Gemma 4](/recipes-e2e-examples/gemma-4)                     |
| **Fine-tune dLLM**              | You want to fine-tune a diffusion language model (e.g., LLaDA) using masked denoising             | Instruction / chat dataset                        | dLLM       | [Fine-tune dLLM](/recipes-e2e-examples/dllm-fine-tuning)               |
| **Fine-tune Diffusion**         | You want to fine-tune a diffusion model for image or video generation                             | Video / Image dataset                             | Diffusion  | [Fine-tune Diffusion](/recipes-e2e-examples/diffusion-fine-tuning)     |
| **Fine-tune VLM-MoE**           | You need large-scale vision-language training with sparse MoE efficiency                          | Image + text dataset                              | VLM (MoE)  | [Fine-tune VLM-MoE](/recipes-e2e-examples/qwen3-5-vl)                  |
| **Fine-tune agentic VLM-MoE**   | You need image/video context for agentic developer workflows                                      | Image / video + text dataset                      | VLM (MoE)  | [Fine-tune Step-3.7-Flash](/recipes-e2e-examples/step-3-7)             |
| **Fine-tune Audio ASR**         | Adapt Qwen3-Omni for speech recognition on HF audio datasets                                      | Audio + transcript dataset                        | Qwen3-Omni | [Fine-tune Qwen3-Omni ASR](/recipes-e2e-examples/qwen3-omni-asr)       |
| **Embedding fine-tune**         | You want to improve text similarity for search, retrieval, or RAG                                 | Text pairs / retrieval corpus                     | LLM        | Coming Soon                                                            |
| **Fine-tune a large MoE**       | You are adapting a large sparse MoE model (DeepSeek-V3, GLM-5, etc.) to your domain               | Text dataset (e.g., HellaSwag)                    | LLM (MoE)  | [Fine-tune MoE](/recipes-e2e-examples/large-moe-fine-tuning)           |
| **Fine-tune DeepSeek V4 Flash** | You want to fine-tune the DeepSeek V4 Flash hybrid-attention MoE (SWA / CSA / HCA + hash-routing) | Text dataset (e.g., HellaSwag)                    | LLM (MoE)  | [Fine-tune DeepSeek V4 Flash](/recipes-e2e-examples/deepseek-v4-flash) |
| **Fine-tune Hy3-preview**       | You want to fine-tune Tencent's 295B MoE with sigmoid routing and per-head QK RMSNorm             | Text dataset (e.g., HellaSwag)                    | LLM (MoE)  | [Fine-tune Hy3-preview](/recipes-e2e-examples/hy3-preview)             |
| **Fine-tune Nemotron-3 Ultra**  | You want to fine-tune NVIDIA's 550B-A55B hybrid Mamba-2 / LatentMoE model with MTP                | Text dataset (e.g., HellaSwag)                    | LLM (MoE)  | [Fine-tune Nemotron-3 Ultra](/recipes-e2e-examples/nemotron-3-ultra)   |
| **Sequence classification**     | You need to classify text into categories (sentiment, topic, NLI)                                 | Text + labels (e.g., GLUE MRPC)                   | LLM        | [Train classifier](/recipes-e2e-examples/sequence-classification)      |
| **QAT fine-tune**               | You want a quantized model that keeps accuracy for efficient deployment                           | Text dataset                                      | LLM        | [Enable QAT](/recipes-e2e-examples/qat)                                |
| **Knowledge distillation**      | You want a smaller, faster model that retains most of the teacher's quality                       | Instruction dataset + teacher model               | LLM        | [Distill a model](/recipes-e2e-examples/knowledge-distillation)        |
| **Pretrain an LLM**             | You are building a base model from scratch on your own corpus                                     | Large unlabeled text corpus (e.g., FineWeb-Edu)   | LLM        | [Start pretraining](/recipes-e2e-examples/pretraining)                 |
| **Pretrain (NanoGPT)**          | You want quick pretraining experiments on a single node                                           | FineWeb / text corpus                             | LLM        | [Try NanoGPT](/recipes-e2e-examples/nanogpt-pretraining)               |

## Performance

Training throughput on NVIDIA GPUs with optimized kernels for Hugging Face models.

| Model            | GPUs | TFLOPs/sec/GPU | Tokens/sec/GPU | Optimizations          |
| ---------------- | ---- | -------------- | -------------- | ---------------------- |
| DeepSeek V3 671B | 256  | 250            | 1,002          | TE + DeepEP            |
| GPT-OSS 20B      | 8    | 279            | 13,058         | TE + DeepEP + FlexAttn |
| Qwen3 MoE 30B    | 8    | 277            | 12,040         | TE + DeepEP            |

See the [full benchmark results](/performance/performance-summary) for configuration details and more models.

## Advanced Topics

Parallelism, precision, checkpointing strategies, and experiment tracking.

Torch-native pipelining composable with FSDP2 and DTensor.
3d-parallelism

Mixed-precision FP8 training with torchao.
FP8 mixed-precision

fp32 master weights, bf16 compute, and the precision traps to avoid.
bf16 mixed-precision

Distributed checkpoints with SafeTensors output.
DCP safetensors

Trade compute for memory with activation checkpointing.
memory-efficiency

Train with quantization for deployment-ready models.
QAT

Track experiments and metrics with MLflow and Wandb.
MLflow Wandb

## For Developers

Components, recipes, and CLI architecture.

Auto-generated Python API documentation.

Drop-in accelerated backend for TRL, lm-eval-harness, OpenRLHF, or any code that loads Hugging Face models.

***

::

::

::

::

::

::

::

::