Release Notes
0.4.0 · 26.04 (2026-04-28) · PyPI · GH · NGC Docker
Highlights
- Discrete-diffusion LLMs (dLLM). SFT and generation support for dLLM models, including Llada.
- Embedding and retrieval training. Reranker training, biencoder datasets loaded directly from the Hugging Face Hub, in-batch negative sampling, and ONNX export for biencoder models.
- SkyPilot launcher. Native multi-node launch on cloud (SkyPilot,
including Kubernetes), in addition to local interactive runs. SkyPilot and
NeMo Run launchers are selected with YAML sections in the config; SLURM jobs
use the
sbatch slurm.subworkflow. - CLI install profile. The
nemo-automodel[cli]extra declarespyyamlbeyond the package’s base dependencies for job-submission configs. - Refreshed CLI.
automodel <config.yaml>(aliasam) replaces the olderautomodel <command> <domain> -c <config>form.
New Models
- LLM: GLM-5, MiniMax-M2.5, Nemotron Super v3, Nemotron Nano 4B/8B.
- MoE / VLM: Qwen3.5-MoE (397B-A17B, 35B-A3B).
- VLM: Gemma 4, Mistral Small 4, Qwen3.5 small dense models.
- Diffusion: FLUX.1-dev, Wan 2.1 T2V, HunyuanVideo 1.5; Wan multi-resolution and LoRA recipes for diffusion.
Distributed Training
- Context parallelism for Qwen3.5-MoE and Nemotron v3.
- Pipeline parallelism for knowledge distillation.
- HybridEP and UCCL-EP as alternative expert-parallel dispatchers.
- FSDP2 weight prefetching and async TP optimization.
- TP > 1 in knowledge distillation.
Performance and Kernels
- TE Linear layers enabled for PEFT/LoRA.
torch._grouped_mmexpert backend.- fp32 RMSNorm backend and
cast_model_to_dtypecontrols. - TP-aware KD loss with distributed softmax and T² scaling.
- FlashOptim optimizer integration.
- Sequence-packing updates: Qwen3.5-MoE VLM neat-packing recipe with EP+PP; Generic THD collation for chat datasets; CP/BSHD padding fixes.
PEFT
- MoE LoRA: rank scaling,
torch_mmintegration, expert-LoRA init usingconfig.expert_dim. merge_loratool for materializing adapters into the base model.- QLoRA PEFT checkpoints saved with the HF adapter prefix.
Recipes and Workflow
- New recipes for Gemma 4 (LoRA), Nemotron Nano 4B SQuAD, Mistral Small 4, Tulu-3 E2E convergence, GPT-OSS 20B / Moonlight 16B convergence, and reranker / biencoder training.
- MFU logging for LLM and dLLM train recipes.
- Native Comet ML experiment tracking.
- NEFTune noisy embeddings for instruction fine-tuning.
- Scheduler-driven manual garbage collection.
- Common inference utility and
.generate()with KV cache for Nemotron v3.
Checkpointing
v4_compatiblecheckpoint format.- Diffusion full fine-tuning and pretraining examples use safetensors
checkpoint format; diffusion LoRA examples use
torch_save. - QLoRA / LoRA loading robustness; tied-weight handling moved out of
_init_model.
Fixes
- FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params.
- Activation checkpointing silently skipped on registered VLMs (ModuleList flattening).
- Gradient checkpointing for MoE models on single GPU (
ep_size=1). - Gradient clipping with
torch_mm+ EP (GPT-OSS 120B recipe). - Rotary embeddings for v4 models;
inputs_embedspassthrough for Nano v3.
Breaking Changes
A migration guide for the new CLI, the recipe YAML section, the SLURM
sbatch-script workflow, and the nemo-automodel[cli] install profile is in
Breaking Changes.
0.3.0 · 26.02 (2026-02-26) · PyPI · GH · NGC Docker
Highlights
- Transformers v4 / v5 alignment. New
transformers v4API support and a v5 refactor for device-mesh-only model init. - Streaming safetensors writer for faster checkpoint export.
- Faster fp8 dequant kernels with DTensor dequantization fixes for DSv3.
New Models
- LLM: DeepSeek V3.2, Step-3.5-Flash, MiniMax-M2.1, Nemotron-3-Nano-30B-A3B, Nemotron Flash 1B, GLM-4.7, Devstral-Small-2-24B.
- MoE / VLM / Omni: Qwen3-VL (4B/8B), Qwen3-VL-MoE (30B/235B), Kimi-VL, Kimi-K2.5 VL, Nemotron-Parse VLM, InternVL3.5-4B, Ministral3 (3B/8B/14B), Phi-4-multimodal.
Distributed Training
- v5 refactor: device-mesh-only model init.
- TP plan for Ministral; Ministral3 ported to transformers v4.
- Pipeline-parallelism validation support.
- Parallel diffusers
generate().
Performance and Kernels
- TE fp8 for models that support it.
GroupedExpertsTEbackend (prerequisite for MoE fp8).- TE RoPE fusion for custom MoE models; norm fusion and RoPE cache for dense models.
- Improved import time.
PEFT
- DoRA implementation.
- LoRA support for custom MoEs.
- LoRA support in Biencoder.
Datasets and Workflow
- Databricks Delta Lake dataset support; consolidation for Databricks.
- Parquet file support; inline text dataset format.
ColumnMapped: configurable special tokens, chat-template flags, and answer-only masking.- Hard negative mining and biencoder + inline-dataset tests.
- nsys benchmark support and model-layer name scoping in the CLI.
- Updated checkpoint auto-loading with explicit
restore_from. - Dion optimizer.
- Functiongemma + xlam tool-calling recipes.
Fixes
inputs_embedspassthrough for Nano v3.from_pretrained/from_configsimplification with model-id pass-through.- Tied-embedding detection improvements.
0.2.0 · 25.11 (2025-12-04) · PyPI · GH · NGC Docker
Highlights
- Async checkpointing. Checkpoint refactor with async DCP and HF safetensors backport / consolidation.
- Custom MoE optimizations. FSDP optimizations, packed-sequence + context
parallel through TE, configurable router precision, fp32
lm_headand fp32apply_rope. - Performance documentation. New performance-summary doc and benchmarking recipe with configs.
- Multinode + cluster guidance. Multinode configs and updated launcher docs.
New Models
- MoE: Qwen3 MoE custom implementation, Qwen3 Next, GPT-OSS (custom implementation, dequantization, DGX Spark recipe), GLM 4 / 4.5 / 4.6 MoE, GLM 4.5 Air, Moonlight 2L test, Phi 4 (TP plan).
- Omni / VLM: Qwen3-Omni OOTB recipe and custom implementation.
- DeepSeek v3 with fp8 base checkpoint loading.
- Sequence classification: Qwen3ForSequenceClassification registered; generic SFT sequence-classification recipe.
Distributed Training
- VLM expert-parallel recipe support.
- PP for VLM; PEFT with PP.
- Sharding optimization for SP / LoRA.
clip_grad_normacross all parallelism modes.fully_shard_by_dtypeoption.- Out-of-tree (OOT) parallelism decorator.
Performance and Kernels
- Mask creation moved into the data pipeline for better performance.
- TE attention for GPT-OSS.
- Faster fp8 dequant; auto-detect base-weights dequant.
PEFT
- LoRA-aware
ColwiseParallel/RowwiseParallel. - LoRA + TE.
- MFU estimation for LoRA.
- Additional PEFT LoRA recipes.
Datasets and Recipes
- Multiturn chat dataset; VLM multiturn chat support.
- Tool-calling dataset and recipe.
- Streaming dataset.
- Multiple validation datasets with per-dataset logging.
- ColumnMapped: surface truncating + padding options.
- Configurable max-clip-grad; configurable remote-logging frequency using
step_scheduler. - Validation-loss checkpoint, run-val-at-ckpt, best-ckpt symlink.
- InternVL recipe; Qwen3-VL 30B recipe; Llama-Embed-Nemotron-8B training.
Logging and Observability
- MLflow integration.
- Metric logger with JSONL output.
- YAML logging-to-stdout improvements.
Workflow
- Knowledge-distillation custom validation step;
ScopedModuleOffloadingto reduce memory. - Model Registry component.
- SIGTERM handling.
NEMO_ENABLE_USER_MODULESfor user-extension modules.- Rank-0 download for custom models.
- Dereference env vars in YAML.
0.1.2 (2025-10-23) · PyPI · GH
Patch release.
- Fix:
max_stepsnow set inside the constructor (#650). - Fix: step scheduler switched to zero-based indexing (#627).
- Fix: sample-limit handling for
ColumnMappeddatasets (#521).
0.1.0 (2025-10-08) · PyPI · GH
Initial public release of NeMo AutoModel.
Highlights
- PyTorch-native training framework for LLMs and VLMs with Hugging Face
Transformers integration via
NeMoAuto*wrapper classes. - YAML-driven recipes for SFT and PEFT.
- FSDP2 / HSDP / DDP distributed training with DTensor sharding.
- Megatron-FSDP available as the default heavy-duty sharding option (replaces the earlier nvFSDP path).
- Knowledge distillation recipe.
- MoE component with DeepSeek v3 model implementation.
ColumnMappedTextInstructionDatasetfor instruction tuning.- Gradient checkpointing.
- SLURM launcher.
For the list of newly supported models per release, see the Model Coverage Release Log.