Known Issues#

We will release fixes for the following issues shortly:

25.07 Known Issues
- DeepSeek model pretraining has a memory spike at the end of training, after the validation loop and checkpoint saving. The memory spike is linked to the cross-entropy layer. This may lead to an NCCL error at the end of training.
- When finetuning with CP > 1, you might need to set calculate_per_token_loss = True for some cases. It depends on the dataset you choose. Note that this will result in slightly different loss from before, but both will lead to model convergence.
- TensorRT-LLM has to be installed in order to run the ONNX export tutorial for LLM embedding models in Finetuning Llama 3.2 Model into Embedding Model tutorial. Use the following instructions for installing TensorRT-LLM: NVIDIA-Nemo/Export-deploy.
- Exporting with ONNX requires transformers v4.51. By default, the container comes with v4.53. Consider downgrading transformers by running uv pip install transformers==4.51.0. For use cases outside the container, the command is pip install transformers==4.51.0.
- Distributed checkpoint saving fails for Nemotron-h 47B and 56B on GB200. No issues observed on H100 or B200.
25.04.02 and 25.04.01 Known Issues
- Tensor-Parallel Communication Overlap: Functional errors may occur with specific tensor-parallel communication overlap configurations. This includes:
  - AllGather+GEMM overlap.
  - Ring-exchange algorithm when aggregate=True.
- LayerNorm Bias Accuracy: Training models using LayerNorm with bias (e.g., StarCoder2) might exhibit accuracy issues.
  - A fix is available in TransformerEngine commit 1569.
  - This fix is not yet included in the current NeMo release container.
  - Workaround: Manually mount or pip install the latest TransformerEngine version in your container.
- Large Model Checkpoint NaN Errors (T5 11B, StarCoder2 7B): Loading trained checkpoints for fine-tuning T5 (11B) and StarCoder2 (7B) models may result in NaN values.
  - This is suspected to be a checkpoint saving/loading error.
  - A potential fix is in Mcore PR 48cc46f.
  - This fix is currently under testing.
- MXFP8 Memory Usage: MXFP8 is currently using more memory than expected. A fix is in progress.
- FP8 in AutoModel Workflow:
  - Using FP8 in the AutoModel workflow requires manually setting use_linear_ce_loss to False.
  - Alternatively, upgrade NeMo to commit 64f0fa.
  - FP8 support for Mixture of Experts (MoE) models is planned for a future release.
- HF Export for Llama-3_3-Nemotron-Super-49B-v1: Hugging Face export is not currently supported for the Llama-3_3-Nemotron-Super-49B-v1 model.
25.04.00 Known Issues
- Llama 4 accuracy may degrade slightly due to an issue with the order of sigmoid application in the expert routing logic. This has been fixed in the following Megatron-core commit: NVIDIA/Megatron-LM However, the fix is not yet included in the current NeMo release container. To apply the fix, please manually mount the updated Megatron-core source when building or running your container.
- Resuming from local checkpoints using the get_global_step_from_global_checkpoint_path utility function may face challenges with auto-inserted metrics in the path. This is fixed in NVIDIA/NeMo#13012. However, the fix is not yet included in the current NeMo release container.
- Tensor-parallel communication overlap with the following configuration may have functional errors: AllGather+GEMM overlap, ring-exchange algorithm with aggregate=True.
- For script scripts/vlm/automodel.py , the gbs argument is string instead of int. Also this script needs to be ran via torchrun for devices > 1.
- There might be accuracy issues when training models that use LayerNorm with bias (e.g., StarCoder2). This issue has been addressed in the following TransformerEngine commit: NVIDIA/TransformerEngine#files. However, the fix is not yet included in the current NeMo release container. To apply the fix, please manually mount or pip install the latest version of TransformerEngine in your container.
- T5 and StarCoder for large config model (11B for T5, 7B for StarCoder2) getting NaN values when loading trained checkpoint for finetuning. We suspect a checkpoint saving/loading error, which is supposed to be fixed with recent Mcore PR (NVIDIA/Megatron-LM). Currently we are testing this fix.
- MXFP8 currently uses more memory than expected and we are still fixing it.
- FP8 in the AutoModel workflow requires setting use_linear_ce_loss to False manually, or upgrading NeMo to 64f0fa commit. FP8 support for MoE models is scheduled for future release.
- No HF export support for Llama-3_3-Nemotron-Super-49B-v1.
25.02 Known Issues
- Automodel
  - Primarily a functional release, performance improvements are planned for future versions.
  - For large models (e.g., > 40B) trained with FSDP2, checkpoint saving can take longer than expected.
  - Support for long sequences is currently limited, esp. for large models > 30B.
  - Models with external dependencies may fail to run, if dependencies are unavailable (e.g., missing package leading to failed import).
  - A small percentage of models available via AutoModelForCausalLM may only support inference, and have training capabilities explicitly disabled.
  - Support for FSDP2 with mixed weights models (e.g. FP8 + BF16) is scheduled for future releases.
- Support for Context Parallelism with sequence packing + padding between sequences is currently broken (see issue #12174). Use 24.12 or upgrade to TE 2.0+ for working support. Will be fixed in future versions.
- MoE based models are seeing an instability with training. Please continue to use 24.12 for MoE training until 25.02 is patched with the fix for MoE.
In 24.12, NeMo switched from pytorch_lightning to lightning.pytorch. If you have custom code that imports pytorch_lightning, you should replace the import with lightning.pytorch. Failing to do so will result in an error that looks like this:
```
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/model_helpers.py", line 42, in is_overridden
  raise ValueError("Expected a parent")
ValueError: Expected a parent
```

Similarly, when using a 24.12 container or later, if running evaluations using the LM Evaluation Harness, be sure to upgrade the version of LM evaluation harness to include this commit. This can be done by following these install instructions. Failing to do so will results in an error that looks like this:

ValueError: You selected an invalid strategy name: `strategy=<nemo.collections.nlp.parts.nlp_overrides.NLPDDPStrategy object at 0x1554480d2410>`. It must be either a string or an instance of `pytorch_lightning.strategies.Strategy`.
Example choices: auto, ddp, ddp_spawn, deepspeed, ... Find a complete list of options in our documentation at https://lightning.ai

Restoring the model context for NeMo 2.0 checkpoints produced using the NeMo 24.09 container fails when building the OptimizerConfig class from the megatron.core.optimizer.optimizer_config module, as the overlap_grad_reduce and overlap_param_gather parameters were moved from the config API in Megatron Core. The update_io_context.py script drops unknown parameters from the checkpoint context to make it compatible with the latest container.
Griffin’s (NeMo 1.0) full fine-tuning has checkpoint loading issues; the state dicts are not matching between the provided checkpoint and the initialized model. Please use the 24.07 container if this model is needed.
NeMo_Forced_Aligner_Tutorial.ipynb has an AttributeError, please use the 24.09 container if this notebook is needed.
Pretrain Gemma 2 27b recipe needs at least 2 nodes, currently the recipe has the default number of nodes set to 1.
The Megatron Core Distributed Optimizer currently lacks memory capacity optimization, resulting in higher model state memory usage at small data parallel sizes. We will include this optimization in the next patch.
The overlap of the data-parallel parameter AllGather with optimizer.step (overlap_param_gather_with_optimizer=true) does not work with distributed checkpointing. Support for distributed checkpointing will be available in the next public release.
Support for converting models from NeMo 2.0 to 1.0 is not yet available. This support will be needed to align models until NeMo Aligner natively supports 2.0.

Transformer Engine changed the way metadata is stored in checkpoints after v1.10, which can cause checkpoint incompatibilities when using a Transformer Engine version later than v1.10 to load a checkpoint trained with an earlier version. Errors of this form look similar to the following:

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 315, in create_default_local_load_plan
  raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")
RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24.

or

File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/common.py", line 118, in load_sharded_object
  raise CheckpointingException(err_msg) from e
megatron.code.dist_checkpointing.core.CheckpointingException: Object shard .../model.decoder.layers.self_attention.core_attention._extra_state/shard_0_4.pt not found

To work around this issue, use model.dist_ckpt_load_strictness=log_all when working with Transformer Engine v1.10 or higher. You can find the Transformer Engine versions present in each NeMo container on the Software Component Versions page.

For data preparation of GPT models, use your own dataset or an online dataset legally approved by your organization.
A race condition in the NeMo experiment manager can occur when multiple processes or threads attempt to access and modify shared resources simultaneously, leading to unpredictable behavior or errors.
The Mistral and Mixtral tokenizers require a Hugging Face login.
Exporting Gemma, Starcoder, and Falcon 7B models to TRT-LLM only works with a single GPU. Additionally, if you attempt to export with multiple GPUs, no descriptive error message is shown.
The following notebooks have functional issues and will be fixed in the next release:
- ASR_with_NeMo.ipynb
- ASR_with_Subword_Tokenization.ipynb
- AudioTranslationSample.ipynb
- Megatron_Synthetic_Tabular_Data_Generation.ipynb
- SpellMapper_English_ASR_Customization.ipynb
- FastPitch_ChineseTTS_Training.ipynb
- NeVA Tutorial.ipynb
Export
- Export Llama70B vLLM causes an out-of-memory issue. It requires more time for the root cause analysis.
- Export vLLM does not support LoRA and P-tuning; however, LoRA support will be added in the next release.
- In-framework (PyTorch level) deployment with 8 GPUs is encountering an error; more time is needed to understand the cause.
- Query script under scripts/deploy/nlp/query.py is giving the An error occurred: ‘output_generation_logits’ error in the 24.12 container. It’ll be fixed in the next container release.
Multimodal - LITA tutorial issue: tutorials/multimodal/LITA_Tutorial.ipynb The data preparation part requires users to manually download the youmakeup dataset instead of using the provided script. - LITA (Language-Independent Tokenization Algorithm) tutorial issue: The data preparation part in tutorials/multimodal/LITA_Tutorial.ipynb requires you to manually download the youmakeup dataset instead of using the provided script. - Add the argument, exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True, to the NeVA notebook pretraining procedure to ensure an end-to-end workflow. Additional argument exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True should be added to Neva Notebook Pretraining Part to ensure e2e workflow.
ASR - Timestamp misalignment occurs in FastConformer ASR models when using the ASR decoder for diarization. Related Issue: #8438.