Release Notes#

0.4.0 · 26.04 (2026-04-28) · PyPI · GH · NGC Docker #

Highlights#

Discrete-diffusion LLMs (dLLM). SFT and generation support for dLLM models, including Llada.
Embedding and retrieval training. Reranker training, biencoder datasets loaded directly from the Hugging Face Hub, in-batch negative sampling, and ONNX export for biencoder models.
SkyPilot launcher. Native multi-node launch on cloud (SkyPilot, including Kubernetes), in addition to local interactive runs. SkyPilot and NeMo Run launchers are selected with YAML sections in the config; SLURM jobs use the sbatch slurm.sub workflow.
CLI install profile. The nemo-automodel[cli] extra declares pyyaml beyond the package’s base dependencies for job-submission configs.
Refreshed CLI. automodel <config.yaml> (alias am) replaces the older automodel <command> <domain> -c <config> form.

New Models#

LLM: GLM-5, MiniMax-M2.5, Nemotron Super v3, Nemotron Nano 4B/8B.
MoE / VLM: Qwen3.5-MoE (397B-A17B, 35B-A3B).
VLM: Gemma 4, Mistral Small 4, Qwen3.5 small dense models.
Diffusion: FLUX.1-dev, Wan 2.1 T2V, HunyuanVideo 1.5; Wan multi-resolution and LoRA recipes for diffusion.

Distributed Training#

Context parallelism for Qwen3.5-MoE and Nemotron v3.
Pipeline parallelism for knowledge distillation.
HybridEP and UCCL-EP as alternative expert-parallel dispatchers.
FSDP2 weight prefetching and async TP optimization.
TP > 1 in knowledge distillation.

Performance and Kernels#

TE Linear layers enabled for PEFT/LoRA.
torch._grouped_mm expert backend.
fp32 RMSNorm backend and cast_model_to_dtype controls.
TP-aware KD loss with distributed softmax and T² scaling.
FlashOptim optimizer integration.
Sequence-packing updates: Qwen3.5-MoE VLM neat-packing recipe with EP+PP; Generic THD collation for chat datasets; CP/BSHD padding fixes.

PEFT#

MoE LoRA: rank scaling, torch_mm integration, expert-LoRA init using config.expert_dim.
merge_lora tool for materializing adapters into the base model.
QLoRA PEFT checkpoints saved with the HF adapter prefix.

Recipes and Workflow#

New recipes for Gemma 4 (LoRA), Nemotron Nano 4B SQuAD, Mistral Small 4, Tulu-3 E2E convergence, GPT-OSS 20B / Moonlight 16B convergence, and reranker / biencoder training.
MFU logging for LLM and dLLM train recipes.
Native Comet ML experiment tracking.
NEFTune noisy embeddings for instruction fine-tuning.
Scheduler-driven manual garbage collection.
Common inference utility and .generate() with KV cache for Nemotron v3.

Checkpointing#

v4_compatible checkpoint format.
Diffusion full fine-tuning and pretraining examples use safetensors checkpoint format; diffusion LoRA examples use torch_save.
QLoRA / LoRA loading robustness; tied-weight handling moved out of _init_model.

Fixes#

FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params.
Activation checkpointing silently skipped on registered VLMs (ModuleList flattening).
Gradient checkpointing for MoE models on single GPU (ep_size=1).
Gradient clipping with torch_mm + EP (GPT-OSS 120B recipe).
Rotary embeddings for v4 models; inputs_embeds passthrough for Nano v3.

Breaking Changes#

A migration guide for the new CLI, the recipe YAML section, the SLURM sbatch-script workflow, and the nemo-automodel[cli] install profile is in Breaking Changes.

0.3.0 · 26.02 (2026-02-26) · PyPI · GH · NGC Docker #

Highlights#

Transformers v4 / v5 alignment. New transformers v4 API support and a v5 refactor for device-mesh-only model init.
Streaming safetensors writer for faster checkpoint export.
Faster fp8 dequant kernels with DTensor dequantization fixes for DSv3.

New Models#

LLM: DeepSeek V3.2, Step-3.5-Flash, MiniMax-M2.1, Nemotron-3-Nano-30B-A3B, Nemotron Flash 1B, GLM-4.7, Devstral-Small-2-24B.
MoE / VLM / Omni: Qwen3-VL (4B/8B), Qwen3-VL-MoE (30B/235B), Kimi-VL, Kimi-K2.5 VL, Nemotron-Parse VLM, InternVL3.5-4B, Ministral3 (3B/8B/14B), Phi-4-multimodal.

Distributed Training#

v5 refactor: device-mesh-only model init.
TP plan for Ministral; Ministral3 ported to transformers v4.
Pipeline-parallelism validation support.
Parallel diffusers generate().

Performance and Kernels#

TE fp8 for models that support it.
GroupedExpertsTE backend (prerequisite for MoE fp8).
TE RoPE fusion for custom MoE models; norm fusion and RoPE cache for dense models.
Improved import time.

PEFT#

DoRA implementation.
LoRA support for custom MoEs.
LoRA support in Biencoder.

Datasets and Workflow#

Databricks Delta Lake dataset support; consolidation for Databricks.
Parquet file support; inline text dataset format.
ColumnMapped: configurable special tokens, chat-template flags, and answer-only masking.
Hard negative mining and biencoder + inline-dataset tests.
nsys benchmark support and model-layer name scoping in the CLI.
Updated checkpoint auto-loading with explicit restore_from.
Dion optimizer.
Functiongemma + xlam tool-calling recipes.

Fixes#

inputs_embeds passthrough for Nano v3.
from_pretrained / from_config simplification with model-id pass-through.
Tied-embedding detection improvements.

0.2.0 · 25.11 (2025-12-04) · PyPI · GH · NGC Docker #

Highlights#

Async checkpointing. Checkpoint refactor with async DCP and HF safetensors backport / consolidation.
Custom MoE optimizations. FSDP optimizations, packed-sequence + context parallel through TE, configurable router precision, fp32 lm_head and fp32 apply_rope.
Performance documentation. New performance-summary doc and benchmarking recipe with configs.
Multinode + cluster guidance. Multinode configs and updated launcher docs.

New Models#

MoE: Qwen3 MoE custom implementation, Qwen3 Next, GPT-OSS (custom implementation, dequantization, DGX Spark recipe), GLM 4 / 4.5 / 4.6 MoE, GLM 4.5 Air, Moonlight 2L test, Phi 4 (TP plan).
Omni / VLM: Qwen3-Omni OOTB recipe and custom implementation.
DeepSeek v3 with fp8 base checkpoint loading.
Sequence classification: Qwen3ForSequenceClassification registered; generic SFT sequence-classification recipe.

Distributed Training#

VLM expert-parallel recipe support.
PP for VLM; PEFT with PP.
Sharding optimization for SP / LoRA.
clip_grad_norm across all parallelism modes.
fully_shard_by_dtype option.
Out-of-tree (OOT) parallelism decorator.

Performance and Kernels#

Mask creation moved into the data pipeline for better performance.
TE attention for GPT-OSS.
Faster fp8 dequant; auto-detect base-weights dequant.

PEFT#

LoRA-aware ColwiseParallel / RowwiseParallel.
LoRA + TE.
MFU estimation for LoRA.
Additional PEFT LoRA recipes.

Datasets and Recipes#

Multiturn chat dataset; VLM multiturn chat support.
Tool-calling dataset and recipe.
Streaming dataset.
Multiple validation datasets with per-dataset logging.
ColumnMapped: surface truncating + padding options.
Configurable max-clip-grad; configurable remote-logging frequency using step_scheduler.
Validation-loss checkpoint, run-val-at-ckpt, best-ckpt symlink.
InternVL recipe; Qwen3-VL 30B recipe; Llama-Embed-Nemotron-8B training.

Logging and Observability#

MLflow integration.
Metric logger with JSONL output.
YAML logging-to-stdout improvements.

Workflow#

Knowledge-distillation custom validation step; ScopedModuleOffloading to reduce memory.
Model Registry component.
SIGTERM handling.
NEMO_ENABLE_USER_MODULES for user-extension modules.
Rank-0 download for custom models.
Dereference env vars in YAML.

0.1.2 (2025-10-23) · PyPI · GH #

Patch release.

Fix: max_steps now set inside the constructor (#650).
Fix: step scheduler switched to zero-based indexing (#627).
Fix: sample-limit handling for ColumnMapped datasets (#521).

0.1.0 (2025-10-08) · PyPI · GH #

Initial public release of NeMo AutoModel.

Highlights#

PyTorch-native training framework for LLMs and VLMs with Hugging Face Transformers integration via NeMoAuto* wrapper classes.
YAML-driven recipes for SFT and PEFT.
FSDP2 / HSDP / DDP distributed training with DTensor sharding.
Megatron-FSDP available as the default heavy-duty sharding option (replaces the earlier nvFSDP path).
Knowledge distillation recipe.
MoE component with DeepSeek v3 model implementation.
ColumnMappedTextInstructionDataset for instruction tuning.
Gradient checkpointing.
SLURM launcher.

For the list of newly supported models per release, see the Model Coverage Release Log.

Release Notes#

0.4.0 · 26.04 (2026-04-28) · PyPI · GH · NGC Docker#

Highlights#

New Models#

Distributed Training#

Performance and Kernels#

PEFT#

Recipes and Workflow#

Checkpointing#

Fixes#

Breaking Changes#

0.3.0 · 26.02 (2026-02-26) · PyPI · GH · NGC Docker#

Highlights#

New Models#

Distributed Training#

Performance and Kernels#

PEFT#

Datasets and Workflow#

Fixes#

0.2.0 · 25.11 (2025-12-04) · PyPI · GH · NGC Docker#

Highlights#

New Models#

Distributed Training#

Performance and Kernels#

PEFT#

Datasets and Recipes#

Logging and Observability#

Workflow#

0.1.2 (2025-10-23) · PyPI · GH#

0.1.0 (2025-10-08) · PyPI · GH#

Highlights#

0.4.0 · 26.04 (2026-04-28) · PyPI · GH · NGC Docker #

0.3.0 · 26.02 (2026-02-26) · PyPI · GH · NGC Docker #

0.2.0 · 25.11 (2025-12-04) · PyPI · GH · NGC Docker #

0.1.2 (2025-10-23) · PyPI · GH #

0.1.0 (2025-10-08) · PyPI · GH #