πŸš€ NeMo AutoModel#


NeMo Framework is NVIDIA’s GPU accelerated, end-to-end training framework for large language models (LLMs), multi-modal models and speech models. It enables seamless scaling of training (both pretraining and post-training) workloads from single GPU to thousand-node clusters for both πŸ€—Hugging Face/PyTorch and Megatron models. It includes a suite of libraries and recipe collections to help users train models from end to end. The AutoModel library (β€œNeMo AutoModel”) provides GPU-accelerated PyTorch training for πŸ€—Hugging Face models on Day-0. Users can start training and fine-tuning models instantly without conversion delays, scale effortlessly with PyTorch-native parallelisms, optimized custom kernels, and memory-efficient recipes-all while preserving the original checkpoint format for seamless use across the Hugging Face ecosystem.

⚠️ Note: NeMo AutoModel is under active development. New features, improvements, and documentation updates are released regularly. We are working toward a stable release, so expect the interface to solidify over time. Your feedback and contributions are welcome, and we encourage you to follow along as new updates roll out.

πŸŽ›οΈ Supported Models#

NeMo AutoModel provides native support for a wide range of models available on the Hugging Face Hub, enabling efficient fine-tuning for various domains.

Large Language Models#

  • LLaMA Family: LLaMA 3, LLaMA 3.1, LLaMA 3.2, Code Llama

  • QWen Family: QWen3, QWen2.5, Qwen2

  • Gemma Family: Gemma2, Gemma3

  • Phi Family: Phi2, Phi3, Phi4

  • And more: Any causal LM on Hugging Face Hub!

Vision-Language Models#

  • Qwen2.5-VL: All variants (3B, 7B, 72B)

  • Gemma-3-VL: 3B and other variants

πŸ“‹ Ready-to-Use Recipes#

To get started quickly, NeMo AutoModel provides a collection of ready-to-use recipes for common LLM and VLM fine-tuning tasks. Simply select the recipe that matches your model and training setup (e.g., single-GPU, multi-GPU, or multi-node).

Domain

Model ID

Single-GPU

Single-Node

Multi-Node

LLM

meta-llama/Llama-3.2-1B

HellaSwag + LoRA

β€’HellaSwag
β€’SQuAD

HellaSwag + nvFSDP

VLM

google/gemma-3-4b-it

CORD-v2 + LoRA

CORD-v2

Coming Soon

Run a Recipe#

To run a NeMo AutoModel recipe, you need a recipe script (e.g., LLM, VLM) and a YAML config file (e.g., LLM, VLM):

# Command invocation format:
uv run <recipe_script_path> --config <yaml_config_path>

# LLM example: multi-GPU with FSDP2
uv run torchrun --nproc-per-node=8 recipes/llm/finetune.py --config recipes/llm/llama_3_2_1b_hellaswag.yaml

# VLM example: single GPU fine-tuning (Gemma-3-VL) with LoRA
uv run recipes/vlm/finetune.py --config recipes/vlm/gemma_3_vl_3b_cord_v2_peft.yaml

πŸš€ Key Features#

  • Day-0 Hugging Face Support: Instantly fine-tune any model from the Hugging Face Hub

  • Lightning Fast Performance: Custom CUDA kernels and memory optimizations deliver 2–5Γ— speedups

  • Large-Scale Distributed Training: Built-in FSDP2 and nvFSDP for seamless multi-node scaling

  • Vision-Language Model Ready: Native support for VLMs (Qwen2-VL, Gemma-3-VL, etc)

  • Advanced PEFT Methods: LoRA and extensible PEFT system out of the box

  • Seamless HF Ecosystem: Fine-tuned models work perfectly with Transformers pipeline, VLM, etc.

  • Robust Infrastructure: Distributed checkpointing with integrated logging and monitoring

  • Optimized Recipes: Pre-built configurations for common models and datasets

  • Flexible Configuration: YAML-based configuration system for reproducible experiments

  • FP8 Precision: Native FP8 training & inference for higher throughput and lower memory use

  • INT4 / INT8 Quantization: Turn-key quantization workflows for ultra-compact, low-memory training


✨ Install NeMo AutoModel#

NeMo AutoModel is offered both as a standard Python package installable via pip and as a ready-to-run NeMo Framework Docker container.

Prerequisites#

# We use `uv` for package management and environment isolation.
pip3 install uv

# If you cannot install at the system level, you can install for your user with
# pip3 install --user uv

Run every command with uv run. It auto-installs the virtual environment from the lock file and keeps it up to date, so you never need to activate a venv manually. Example: uv run recipes/llm/finetune.py. If you prefer to install NeMo Automodel explicitly, please follow the instructions below.

πŸ“¦ Install from a Wheel Package#

# Install the latest stable release from PyPI
# We first need to initialize the virtual environment using uv
uv venv

uv pip install nemo_automodel   # or: uv pip install --upgrade nemo_automodel

πŸ”§ Install from Source#

# Install the latest NeMo Automodel from the GitHub repo (best for development).
# We first need to initialize the virtual environment using uv
uv venv

# We can now install from source
uv pip install git+https://github.com/NVIDIA-NeMo/Automodel.git

Verify the Installation#

uv run python -c "import nemo_automodel; print('βœ… NeMo AutoModel ready')"

πŸ“‹ YAML Configuration Examples#

1. Distributed Training Configuration#

distributed:
  _target_: nemo_automodel.distributed.nvfsdp.NVFSDPManager
  dp_size: 8
  tp_size: 1
  cp_size: 1

2. LoRA Configuration#

peft:
  peft_fn: nemo_automodel._peft.lora.apply_lora_to_linear_modules
  match_all_linear: True
  dim: 8
  alpha: 32
  use_triton: True

3. Vision-Language Model Fine-Tuning#

model:
  _target_: nemo_automodel._transformers.NeMoAutoModelForImageTextToText.from_pretrained
  pretrained_model_name_or_path: Qwen/Qwen2.5-VL-3B-Instruct

processor:
  _target_: transformers.AutoProcessor.from_pretrained
  pretrained_model_name_or_path: Qwen/Qwen2.5-VL-3B-Instruct
  min_pixels: 200704
  max_pixels: 1003520

4. Checkpointing and Resume#

checkpoint:
  enabled: true
  checkpoint_dir: ./checkpoints
  save_consolidated: true      # HF-compatible safetensors
  model_save_format: safetensors

πŸ—‚οΈ Project Structure#

NeMo-Automodel/
β”œβ”€β”€ nemo_automodel/              # Core library
β”‚   β”œβ”€β”€ _peft/                   # PEFT implementations (LoRA)
β”‚   β”œβ”€β”€ _transformers/           # HF model integrations  
β”‚   β”œβ”€β”€ checkpoint/              # Distributed checkpointing
β”‚   β”œβ”€β”€ datasets/                # Dataset loaders
β”‚   β”‚   β”œβ”€β”€ llm/                 # LLM datasets (HellaSwag, SQuAD, etc.)
β”‚   β”‚   └── vlm/                 # VLM datasets (CORD-v2, rdr etc.)
β”‚   β”œβ”€β”€ distributed/             # FSDP2, nvFSDP, parallelization
β”‚   β”œβ”€β”€ loss/                    # Optimized loss functions
β”‚   └── training/                # Training recipes and utilities
β”œβ”€β”€ recipes/                     # Ready-to-use training recipes
β”‚   β”œβ”€β”€ llm/                     # LLM fine-tuning recipes
β”‚   └── vlm/                     # VLM fine-tuning recipes  
└── tests/                       # Comprehensive test suite

🀝 Contributing#

We welcome contributions! Please see our Contributing Guide for details.


πŸ“„ License#

NVIDIA NeMo AutoModel is licensed under the Apache License 2.0.