Release Notes

View as Markdown

0.4.0 · 26.04 (2026-04-28) · PyPI · GH · NGC Docker

Highlights

  • Discrete-diffusion LLMs (dLLM). SFT and generation support for dLLM models, including Llada.
  • Embedding and retrieval training. Reranker training, biencoder datasets loaded directly from the Hugging Face Hub, in-batch negative sampling, and ONNX export for biencoder models.
  • SkyPilot launcher. Native multi-node launch on cloud (SkyPilot, including Kubernetes), in addition to local interactive runs. SkyPilot and NeMo Run launchers are selected with YAML sections in the config; SLURM jobs use the sbatch slurm.sub workflow.
  • CLI install profile. The nemo-automodel[cli] extra declares pyyaml beyond the package’s base dependencies for job-submission configs.
  • Refreshed CLI. automodel <config.yaml> (alias am) replaces the older automodel <command> <domain> -c <config> form.

New Models

  • LLM: GLM-5, MiniMax-M2.5, Nemotron Super v3, Nemotron Nano 4B/8B.
  • MoE / VLM: Qwen3.5-MoE (397B-A17B, 35B-A3B).
  • VLM: Gemma 4, Mistral Small 4, Qwen3.5 small dense models.
  • Diffusion: FLUX.1-dev, Wan 2.1 T2V, HunyuanVideo 1.5; Wan multi-resolution and LoRA recipes for diffusion.

Distributed Training

  • Context parallelism for Qwen3.5-MoE and Nemotron v3.
  • Pipeline parallelism for knowledge distillation.
  • HybridEP and UCCL-EP as alternative expert-parallel dispatchers.
  • FSDP2 weight prefetching and async TP optimization.
  • TP > 1 in knowledge distillation.

Performance and Kernels

  • TE Linear layers enabled for PEFT/LoRA.
  • torch._grouped_mm expert backend.
  • fp32 RMSNorm backend and cast_model_to_dtype controls.
  • TP-aware KD loss with distributed softmax and T² scaling.
  • FlashOptim optimizer integration.
  • Sequence-packing updates: Qwen3.5-MoE VLM neat-packing recipe with EP+PP; Generic THD collation for chat datasets; CP/BSHD padding fixes.

PEFT

  • MoE LoRA: rank scaling, torch_mm integration, expert-LoRA init using config.expert_dim.
  • merge_lora tool for materializing adapters into the base model.
  • QLoRA PEFT checkpoints saved with the HF adapter prefix.

Recipes and Workflow

  • New recipes for Gemma 4 (LoRA), Nemotron Nano 4B SQuAD, Mistral Small 4, Tulu-3 E2E convergence, GPT-OSS 20B / Moonlight 16B convergence, and reranker / biencoder training.
  • MFU logging for LLM and dLLM train recipes.
  • Native Comet ML experiment tracking.
  • NEFTune noisy embeddings for instruction fine-tuning.
  • Scheduler-driven manual garbage collection.
  • Common inference utility and .generate() with KV cache for Nemotron v3.

Checkpointing

  • v4_compatible checkpoint format.
  • Diffusion full fine-tuning and pretraining examples use safetensors checkpoint format; diffusion LoRA examples use torch_save.
  • QLoRA / LoRA loading robustness; tied-weight handling moved out of _init_model.

Fixes

  • FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params.
  • Activation checkpointing silently skipped on registered VLMs (ModuleList flattening).
  • Gradient checkpointing for MoE models on single GPU (ep_size=1).
  • Gradient clipping with torch_mm + EP (GPT-OSS 120B recipe).
  • Rotary embeddings for v4 models; inputs_embeds passthrough for Nano v3.

Breaking Changes

A migration guide for the new CLI, the recipe YAML section, the SLURM sbatch-script workflow, and the nemo-automodel[cli] install profile is in Breaking Changes.


0.3.0 · 26.02 (2026-02-26) · PyPI · GH · NGC Docker

Highlights

  • Transformers v4 / v5 alignment. New transformers v4 API support and a v5 refactor for device-mesh-only model init.
  • Streaming safetensors writer for faster checkpoint export.
  • Faster fp8 dequant kernels with DTensor dequantization fixes for DSv3.

New Models

  • LLM: DeepSeek V3.2, Step-3.5-Flash, MiniMax-M2.1, Nemotron-3-Nano-30B-A3B, Nemotron Flash 1B, GLM-4.7, Devstral-Small-2-24B.
  • MoE / VLM / Omni: Qwen3-VL (4B/8B), Qwen3-VL-MoE (30B/235B), Kimi-VL, Kimi-K2.5 VL, Nemotron-Parse VLM, InternVL3.5-4B, Ministral3 (3B/8B/14B), Phi-4-multimodal.

Distributed Training

  • v5 refactor: device-mesh-only model init.
  • TP plan for Ministral; Ministral3 ported to transformers v4.
  • Pipeline-parallelism validation support.
  • Parallel diffusers generate().

Performance and Kernels

  • TE fp8 for models that support it.
  • GroupedExpertsTE backend (prerequisite for MoE fp8).
  • TE RoPE fusion for custom MoE models; norm fusion and RoPE cache for dense models.
  • Improved import time.

PEFT

  • DoRA implementation.
  • LoRA support for custom MoEs.
  • LoRA support in Biencoder.

Datasets and Workflow

  • Databricks Delta Lake dataset support; consolidation for Databricks.
  • Parquet file support; inline text dataset format.
  • ColumnMapped: configurable special tokens, chat-template flags, and answer-only masking.
  • Hard negative mining and biencoder + inline-dataset tests.
  • nsys benchmark support and model-layer name scoping in the CLI.
  • Updated checkpoint auto-loading with explicit restore_from.
  • Dion optimizer.
  • Functiongemma + xlam tool-calling recipes.

Fixes

  • inputs_embeds passthrough for Nano v3.
  • from_pretrained / from_config simplification with model-id pass-through.
  • Tied-embedding detection improvements.

0.2.0 · 25.11 (2025-12-04) · PyPI · GH · NGC Docker

Highlights

  • Async checkpointing. Checkpoint refactor with async DCP and HF safetensors backport / consolidation.
  • Custom MoE optimizations. FSDP optimizations, packed-sequence + context parallel through TE, configurable router precision, fp32 lm_head and fp32 apply_rope.
  • Performance documentation. New performance-summary doc and benchmarking recipe with configs.
  • Multinode + cluster guidance. Multinode configs and updated launcher docs.

New Models

  • MoE: Qwen3 MoE custom implementation, Qwen3 Next, GPT-OSS (custom implementation, dequantization, DGX Spark recipe), GLM 4 / 4.5 / 4.6 MoE, GLM 4.5 Air, Moonlight 2L test, Phi 4 (TP plan).
  • Omni / VLM: Qwen3-Omni OOTB recipe and custom implementation.
  • DeepSeek v3 with fp8 base checkpoint loading.
  • Sequence classification: Qwen3ForSequenceClassification registered; generic SFT sequence-classification recipe.

Distributed Training

  • VLM expert-parallel recipe support.
  • PP for VLM; PEFT with PP.
  • Sharding optimization for SP / LoRA.
  • clip_grad_norm across all parallelism modes.
  • fully_shard_by_dtype option.
  • Out-of-tree (OOT) parallelism decorator.

Performance and Kernels

  • Mask creation moved into the data pipeline for better performance.
  • TE attention for GPT-OSS.
  • Faster fp8 dequant; auto-detect base-weights dequant.

PEFT

  • LoRA-aware ColwiseParallel / RowwiseParallel.
  • LoRA + TE.
  • MFU estimation for LoRA.
  • Additional PEFT LoRA recipes.

Datasets and Recipes

  • Multiturn chat dataset; VLM multiturn chat support.
  • Tool-calling dataset and recipe.
  • Streaming dataset.
  • Multiple validation datasets with per-dataset logging.
  • ColumnMapped: surface truncating + padding options.
  • Configurable max-clip-grad; configurable remote-logging frequency using step_scheduler.
  • Validation-loss checkpoint, run-val-at-ckpt, best-ckpt symlink.
  • InternVL recipe; Qwen3-VL 30B recipe; Llama-Embed-Nemotron-8B training.

Logging and Observability

  • MLflow integration.
  • Metric logger with JSONL output.
  • YAML logging-to-stdout improvements.

Workflow

  • Knowledge-distillation custom validation step; ScopedModuleOffloading to reduce memory.
  • Model Registry component.
  • SIGTERM handling.
  • NEMO_ENABLE_USER_MODULES for user-extension modules.
  • Rank-0 download for custom models.
  • Dereference env vars in YAML.

0.1.2 (2025-10-23) · PyPI · GH

Patch release.

  • Fix: max_steps now set inside the constructor (#650).
  • Fix: step scheduler switched to zero-based indexing (#627).
  • Fix: sample-limit handling for ColumnMapped datasets (#521).

0.1.0 (2025-10-08) · PyPI · GH

Initial public release of NeMo AutoModel.

Highlights

  • PyTorch-native training framework for LLMs and VLMs with Hugging Face Transformers integration via NeMoAuto* wrapper classes.
  • YAML-driven recipes for SFT and PEFT.
  • FSDP2 / HSDP / DDP distributed training with DTensor sharding.
  • Megatron-FSDP available as the default heavy-duty sharding option (replaces the earlier nvFSDP path).
  • Knowledge distillation recipe.
  • MoE component with DeepSeek v3 model implementation.
  • ColumnMappedTextInstructionDataset for instruction tuning.
  • Gradient checkpointing.
  • SLURM launcher.

For the list of newly supported models per release, see the Model Coverage Release Log.