π NeMo AutoModel#
NeMo Framework is NVIDIAβs GPU accelerated, end-to-end training framework for large language models (LLMs), multi-modal models and speech models. It enables seamless scaling of training (both pretraining and post-training) workloads from single GPU to thousand-node clusters for both π€Hugging Face/PyTorch and Megatron models. It includes a suite of libraries and recipe collections to help users train models from end to end. The AutoModel library (βNeMo AutoModelβ) provides GPU-accelerated PyTorch training for π€Hugging Face models on Day-0. Users can start training and fine-tuning models instantly without conversion delays, scale effortlessly with PyTorch-native parallelisms, optimized custom kernels, and memory-efficient recipes-all while preserving the original checkpoint format for seamless use across the Hugging Face ecosystem.
β οΈ Note: NeMo AutoModel is under active development. New features, improvements, and documentation updates are released regularly. We are working toward a stable release, so expect the interface to solidify over time. Your feedback and contributions are welcome, and we encourage you to follow along as new updates roll out.
Features#
β Available now | π Coming in 25.09
β HuggingFace Integration - Works with 1-70B models (Qwen, Llama).
β Distributed Training - Fully Sharded Data Parallel (FSDP2) support.
β Environment Support - Support for SLURM and interactive training.
β Learning Algorithms - SFT (Supervised Fine-Tuning), and PEFT (Parameter Efficient Fine-Tuning).
β Large Model Support - Native PyTorch support for models up to 70B parameters.
β Advanced Parallelism - PyTorch native FSDP2, TP, CP, and SP for efficient training.
β Sequence Packing - Sequence packing in both DTensor and MCore for huge training perf gains.
β DCP - Distributed Checkpoint support with SafeTensors output.
β HSDP - Hybrid Sharding Data Parallelism based on FSDP2.
π Pipeline Support - Torch-native support for pipelining composable with FSDP2 and DTensor (3D Parallelism).
π Pre-training - Support for model pre-training, including DeepSeekV3, GPT-OSS and Qwen3 (Coder-480B-A35B, etc).
π Knowledge Distillation - Support for knowledge distillation with LLMs; VLM support will be added post 25.09.
ποΈ Supported Models#
NeMo AutoModel provides native support for a wide range of models available on the Hugging Face Hub, enabling efficient fine-tuning for various domains.
Large Language Models#
LLaMA Family: LLaMA 3, LLaMA 3.1, LLaMA 3.2, Code Llama
QWen Family: QWen3, QWen2.5, Qwen2
Gemma Family: Gemma2, Gemma3
Phi Family: Phi2, Phi3, Phi4
And more: Any causal LM on Hugging Face Hub!
Vision-Language Models#
Qwen2.5-VL: All variants (3B, 7B, 72B)
Gemma-3-VL: 3B and other variants
π Ready-to-Use Recipes#
To get started quickly, NeMo AutoModel provides a collection of ready-to-use recipes for common LLM and VLM fine-tuning tasks. Simply select the recipe that matches your model and training setup (e.g., single-GPU, multi-GPU, or multi-node).
Domain |
Model ID |
Single-GPU |
Single-Node |
Multi-Node |
---|---|---|---|---|
Coming Soon |
Run a Recipe#
To run a NeMo AutoModel recipe, you need a recipe script (e.g., LLM, VLM) and a YAML config file (e.g., LLM, VLM):
# Command invocation format:
uv run <recipe_script_path> --config <yaml_config_path>
# LLM example: multi-GPU with FSDP2
uv run torchrun --nproc-per-node=8 recipes/llm/finetune.py --config recipes/llm/llama_3_2_1b_hellaswag.yaml
# VLM example: single GPU fine-tuning (Gemma-3-VL) with LoRA
uv run recipes/vlm/finetune.py --config recipes/vlm/gemma_3_vl_3b_cord_v2_peft.yaml
π Key Features#
Day-0 Hugging Face Support: Instantly fine-tune any model from the Hugging Face Hub
Lightning Fast Performance: Custom CUDA kernels and memory optimizations deliver 2β5Γ speedups
Large-Scale Distributed Training: Built-in FSDP2 and nvFSDP for seamless multi-node scaling
Vision-Language Model Ready: Native support for VLMs (Qwen2-VL, Gemma-3-VL, etc)
Advanced PEFT Methods: LoRA and extensible PEFT system out of the box
Seamless HF Ecosystem: Fine-tuned models work perfectly with Transformers pipeline, VLM, etc.
Robust Infrastructure: Distributed checkpointing with integrated logging and monitoring
Optimized Recipes: Pre-built configurations for common models and datasets
Flexible Configuration: YAML-based configuration system for reproducible experiments
FP8 Precision: Native FP8 training & inference for higher throughput and lower memory use
INT4 / INT8 Quantization: Turn-key quantization workflows for ultra-compact, low-memory training
β¨ Install NeMo AutoModel#
NeMo AutoModel is offered both as a standard Python package installable via pip and as a ready-to-run NeMo Framework Docker container.
Prerequisites#
# We use `uv` for package management and environment isolation.
pip3 install uv
# If you cannot install at the system level, you can install for your user with
# pip3 install --user uv
Run every command with uv run
. It auto-installs the virtual environment from the lock file and keeps it up to date, so you never need to activate a venv manually. Example: uv run recipes/llm/finetune.py
. If you prefer to install NeMo Automodel explicitly, please follow the instructions below.
π¦ Install from a Wheel Package#
# Install the latest stable release from PyPI
# We first need to initialize the virtual environment using uv
uv venv
uv pip install nemo_automodel # or: uv pip install --upgrade nemo_automodel
π§ Install from Source#
# Install the latest NeMo Automodel from the GitHub repo (best for development).
# We first need to initialize the virtual environment using uv
uv venv
# We can now install from source
uv pip install git+https://github.com/NVIDIA-NeMo/Automodel.git
Verify the Installation#
uv run python -c "import nemo_automodel; print('β
NeMo AutoModel ready')"
π YAML Configuration Examples#
1. Distributed Training Configuration#
distributed:
_target_: nemo_automodel.distributed.nvfsdp.NVFSDPManager
dp_size: 8
tp_size: 1
cp_size: 1
2. LoRA Configuration#
peft:
peft_fn: nemo_automodel._peft.lora.apply_lora_to_linear_modules
match_all_linear: True
dim: 8
alpha: 32
use_triton: True
3. Vision-Language Model Fine-Tuning#
model:
_target_: nemo_automodel._transformers.NeMoAutoModelForImageTextToText.from_pretrained
pretrained_model_name_or_path: Qwen/Qwen2.5-VL-3B-Instruct
processor:
_target_: transformers.AutoProcessor.from_pretrained
pretrained_model_name_or_path: Qwen/Qwen2.5-VL-3B-Instruct
min_pixels: 200704
max_pixels: 1003520
4. Checkpointing and Resume#
checkpoint:
enabled: true
checkpoint_dir: ./checkpoints
save_consolidated: true # HF-compatible safetensors
model_save_format: safetensors
ποΈ Project Structure#
NeMo-Automodel/
βββ nemo_automodel/ # Core library
β βββ _peft/ # PEFT implementations (LoRA)
β βββ _transformers/ # HF model integrations
β βββ checkpoint/ # Distributed checkpointing
β βββ datasets/ # Dataset loaders
β β βββ llm/ # LLM datasets (HellaSwag, SQuAD, etc.)
β β βββ vlm/ # VLM datasets (CORD-v2, rdr etc.)
β βββ distributed/ # FSDP2, nvFSDP, parallelization
β βββ loss/ # Optimized loss functions
β βββ training/ # Training recipes and utilities
βββ recipes/ # Ready-to-use training recipes
β βββ llm/ # LLM fine-tuning recipes
β βββ vlm/ # VLM fine-tuning recipes
βββ tests/ # Comprehensive test suite
π€ Contributing#
We welcome contributions! Please see our Contributing Guide for details.
π License#
NVIDIA NeMo AutoModel is licensed under the Apache License 2.0.
π Links#
Documentation: https://docs.nvidia.com/nemo-framework/user-guide/latest/automodel/index.html
Hugging Face Hub: https://huggingface.co/models
Issues: https://github.com/NVIDIA-NeMo/Automodel/issues
Discussions: https://github.com/NVIDIA-NeMo/Automodel/discussions
Made with β€οΈ by NVIDIA
Accelerating AI for everyone