Release Notes#
0.4.0 路 26.04 (2026-04-28) 路 PyPI 路 GH 路 NGC Docker#
Highlights#
Discrete-diffusion LLMs (dLLM). SFT and generation support for dLLM models, including Llada.
Embedding and retrieval training. Reranker training, biencoder datasets loaded directly from the Hugging Face Hub, in-batch negative sampling, and ONNX export for biencoder models.
SkyPilot launcher. Native multi-node launch on cloud (SkyPilot, including Kubernetes), in addition to local interactive runs. SkyPilot and NeMo Run launchers are selected with YAML sections in the config; SLURM jobs use the
sbatch slurm.subworkflow.CLI install profile. The
nemo-automodel[cli]extra declarespyyamlbeyond the package鈥檚 base dependencies for job-submission configs.Refreshed CLI.
automodel <config.yaml>(aliasam) replaces the olderautomodel <command> <domain> -c <config>form.
New Models#
LLM: GLM-5, MiniMax-M2.5, Nemotron Super v3, Nemotron Nano 4B/8B.
MoE / VLM: Qwen3.5-MoE (397B-A17B, 35B-A3B).
VLM: Gemma 4, Mistral Small 4, Qwen3.5 small dense models.
Diffusion: FLUX.1-dev, Wan 2.1 T2V, HunyuanVideo 1.5; Wan multi-resolution and LoRA recipes for diffusion.
Distributed Training#
Context parallelism for Qwen3.5-MoE and Nemotron v3.
Pipeline parallelism for knowledge distillation.
HybridEP and UCCL-EP as alternative expert-parallel dispatchers.
FSDP2 weight prefetching and async TP optimization.
TP > 1 in knowledge distillation.
Performance and Kernels#
TE Linear layers enabled for PEFT/LoRA.
torch._grouped_mmexpert backend.fp32 RMSNorm backend and
cast_model_to_dtypecontrols.TP-aware KD loss with distributed softmax and T虏 scaling.
FlashOptim optimizer integration.
Sequence-packing updates: Qwen3.5-MoE VLM neat-packing recipe with EP+PP; Generic THD collation for chat datasets; CP/BSHD padding fixes.
PEFT#
MoE LoRA: rank scaling,
torch_mmintegration, expert-LoRA init usingconfig.expert_dim.merge_loratool for materializing adapters into the base model.QLoRA PEFT checkpoints saved with the HF adapter prefix.
Recipes and Workflow#
New recipes for Gemma 4 (LoRA), Nemotron Nano 4B SQuAD, Mistral Small 4, Tulu-3 E2E convergence, GPT-OSS 20B / Moonlight 16B convergence, and reranker / biencoder training.
MFU logging for LLM and dLLM train recipes.
Native Comet ML experiment tracking.
NEFTune noisy embeddings for instruction fine-tuning.
Scheduler-driven manual garbage collection.
Common inference utility and
.generate()with KV cache for Nemotron v3.
Checkpointing#
v4_compatiblecheckpoint format.Diffusion full fine-tuning and pretraining examples use safetensors checkpoint format; diffusion LoRA examples use
torch_save.QLoRA / LoRA loading robustness; tied-weight handling moved out of
_init_model.
Fixes#
FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params.
Activation checkpointing silently skipped on registered VLMs (ModuleList flattening).
Gradient checkpointing for MoE models on single GPU (
ep_size=1).Gradient clipping with
torch_mm+ EP (GPT-OSS 120B recipe).Rotary embeddings for v4 models;
inputs_embedspassthrough for Nano v3.
Breaking Changes#
A migration guide for the new CLI, the recipe YAML section, the SLURM
sbatch-script workflow, and the nemo-automodel[cli] install profile is in
Breaking Changes.
0.3.0 路 26.02 (2026-02-26) 路 PyPI 路 GH 路 NGC Docker#
Highlights#
Transformers v4 / v5 alignment. New
transformers v4API support and a v5 refactor for device-mesh-only model init.Streaming safetensors writer for faster checkpoint export.
Faster fp8 dequant kernels with DTensor dequantization fixes for DSv3.
New Models#
LLM: DeepSeek V3.2, Step-3.5-Flash, MiniMax-M2.1, Nemotron-3-Nano-30B-A3B, Nemotron Flash 1B, GLM-4.7, Devstral-Small-2-24B.
MoE / VLM / Omni: Qwen3-VL (4B/8B), Qwen3-VL-MoE (30B/235B), Kimi-VL, Kimi-K2.5 VL, Nemotron-Parse VLM, InternVL3.5-4B, Ministral3 (3B/8B/14B), Phi-4-multimodal.
Distributed Training#
v5 refactor: device-mesh-only model init.
TP plan for Ministral; Ministral3 ported to transformers v4.
Pipeline-parallelism validation support.
Parallel diffusers
generate().
Performance and Kernels#
TE fp8 for models that support it.
GroupedExpertsTEbackend (prerequisite for MoE fp8).TE RoPE fusion for custom MoE models; norm fusion and RoPE cache for dense models.
Improved import time.
PEFT#
DoRA implementation.
LoRA support for custom MoEs.
LoRA support in Biencoder.
Datasets and Workflow#
Databricks Delta Lake dataset support; consolidation for Databricks.
Parquet file support; inline text dataset format.
ColumnMapped: configurable special tokens, chat-template flags, and answer-only masking.Hard negative mining and biencoder + inline-dataset tests.
nsys benchmark support and model-layer name scoping in the CLI.
Updated checkpoint auto-loading with explicit
restore_from.Dion optimizer.
Functiongemma + xlam tool-calling recipes.
Fixes#
inputs_embedspassthrough for Nano v3.from_pretrained/from_configsimplification with model-id pass-through.Tied-embedding detection improvements.
0.2.0 路 25.11 (2025-12-04) 路 PyPI 路 GH 路 NGC Docker#
Highlights#
Async checkpointing. Checkpoint refactor with async DCP and HF safetensors backport / consolidation.
Custom MoE optimizations. FSDP optimizations, packed-sequence + context parallel through TE, configurable router precision, fp32
lm_headand fp32apply_rope.Performance documentation. New performance-summary doc and benchmarking recipe with configs.
Multinode + cluster guidance. Multinode configs and updated launcher docs.
New Models#
MoE: Qwen3 MoE custom implementation, Qwen3 Next, GPT-OSS (custom implementation, dequantization, DGX Spark recipe), GLM 4 / 4.5 / 4.6 MoE, GLM 4.5 Air, Moonlight 2L test, Phi 4 (TP plan).
Omni / VLM: Qwen3-Omni OOTB recipe and custom implementation.
DeepSeek v3 with fp8 base checkpoint loading.
Sequence classification: Qwen3ForSequenceClassification registered; generic SFT sequence-classification recipe.
Distributed Training#
VLM expert-parallel recipe support.
PP for VLM; PEFT with PP.
Sharding optimization for SP / LoRA.
clip_grad_normacross all parallelism modes.fully_shard_by_dtypeoption.Out-of-tree (OOT) parallelism decorator.
Performance and Kernels#
Mask creation moved into the data pipeline for better performance.
TE attention for GPT-OSS.
Faster fp8 dequant; auto-detect base-weights dequant.
PEFT#
LoRA-aware
ColwiseParallel/RowwiseParallel.LoRA + TE.
MFU estimation for LoRA.
Additional PEFT LoRA recipes.
Datasets and Recipes#
Multiturn chat dataset; VLM multiturn chat support.
Tool-calling dataset and recipe.
Streaming dataset.
Multiple validation datasets with per-dataset logging.
ColumnMapped: surface truncating + padding options.
Configurable max-clip-grad; configurable remote-logging frequency using
step_scheduler.Validation-loss checkpoint, run-val-at-ckpt, best-ckpt symlink.
InternVL recipe; Qwen3-VL 30B recipe; Llama-Embed-Nemotron-8B training.
Logging and Observability#
MLflow integration.
Metric logger with JSONL output.
YAML logging-to-stdout improvements.
Workflow#
Knowledge-distillation custom validation step;
ScopedModuleOffloadingto reduce memory.Model Registry component.
SIGTERM handling.
NEMO_ENABLE_USER_MODULESfor user-extension modules.Rank-0 download for custom models.
Dereference env vars in YAML.
0.1.2 (2025-10-23) 路 PyPI 路 GH#
Patch release.
Fix:
max_stepsnow set inside the constructor (#650).Fix: step scheduler switched to zero-based indexing (#627).
Fix: sample-limit handling for
ColumnMappeddatasets (#521).
0.1.0 (2025-10-08) 路 PyPI 路 GH#
Initial public release of NeMo AutoModel.
Highlights#
PyTorch-native training framework for LLMs and VLMs with Hugging Face Transformers integration via
NeMoAuto*wrapper classes.YAML-driven recipes for SFT and PEFT.
FSDP2 / HSDP / DDP distributed training with DTensor sharding.
Megatron-FSDP available as the default heavy-duty sharding option (replaces the earlier nvFSDP path).
Knowledge distillation recipe.
MoE component with DeepSeek v3 model implementation.
ColumnMappedTextInstructionDatasetfor instruction tuning.Gradient checkpointing.
SLURM launcher.
For the list of newly supported models per release, see the Model Coverage Release Log.