Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Known Issues#
We will release fixes for the following issues shortly:
In 24.12, NeMo switched from pytorch_lightning to lightning.pytorch. If you have custom code that imports pytorch_lightning, you should replace the import with lightning.pytorch. Failing to do so will result in an error that looks like this:
File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/model_helpers.py", line 42, in is_overridden raise ValueError("Expected a parent") ValueError: Expected a parent
Similarly, when using a 24.12 container or later, if running evaluations using the LM Evaluation Harness, be sure to upgrade the version of LM evaluation harness to include this commit. This can be done by following these install instructions. Failing to do so will results in an error that looks like this:
ValueError: You selected an invalid strategy name: `strategy=<nemo.collections.nlp.parts.nlp_overrides.NLPDDPStrategy object at 0x1554480d2410>`. It must be either a string or an instance of `pytorch_lightning.strategies.Strategy`. Example choices: auto, ddp, ddp_spawn, deepspeed, ... Find a complete list of options in our documentation at https://lightning.ai
Restoring the model context for NeMo 2.0 checkpoints produced using the NeMo 24.09 container fails when building the OptimizerConfig class from the megatron.core.optimizer.optimizer_config module, as the overlap_grad_reduce and overlap_param_gather parameters were moved from the config API in Megatron Core. The update_io_context.py script drops unknown parameters from the checkpoint context to make it compatible with the latest container.
Griffin’s (NeMo 1.0) full fine-tuning has checkpoint loading issues; the state dicts are not matching between the provided checkpoint and the initialized model. Please use the 24.07 container if this model is needed.
NeMo_Forced_Aligner_Tutorial.ipynb has an AttributeError, please use the 24.09 container if this notebook is needed.
Pretrain Gemma 2 27b recipe needs at least 2 nodes, currently the recipe has the default number of nodes set to 1.
The Megatron Core Distributed Optimizer currently lacks memory capacity optimization, resulting in higher model state memory usage at small data parallel sizes. We will include this optimization in the next patch.
The overlap of the data-parallel parameter AllGather with optimizer.step (
overlap_param_gather_with_optimizer=true
) does not work with distributed checkpointing. Support for distributed checkpointing will be available in the next public release.Support for converting models from NeMo 2.0 to 1.0 is not yet available. This support will be needed to align models until NeMo Aligner natively supports 2.0.
Transformer Engine changed the way metadata is stored in checkpoints after v1.10, which can cause checkpoint incompatibilities when using a Transformer Engine version later than v1.10 to load a checkpoint trained with an earlier version. Errors of this form look similar to the following:
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/default_planner.py", line 315, in create_default_local_load_plan raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.") RuntimeError: Missing key in checkpoint state_dict: model.decoder.layers.self_attention.core_attention._extra_state/shard_0_24.
To work around this issue, use
model.dist_ckpt_load_strictness=log_all
when working with Transformer Engine v1.10 or higher. You can find the Transformer Engine versions present in each NeMo container on the Software Component Versions page.For data preparation of GPT models, use your own dataset or an online dataset legally approved by your organization.
A race condition in the NeMo experiment manager can occur when multiple processes or threads attempt to access and modify shared resources simultaneously, leading to unpredictable behavior or errors.
The Mistral and Mixtral tokenizers require a Hugging Face login.
Exporting Gemma, Starcoder, and Falcon 7B models to TRT-LLM only works with a single GPU. Additionally, if you attempt to export with multiple GPUs, no descriptive error message is shown.
The following notebooks have functional issues and will be fixed in the next release:
ASR_with_NeMo.ipynb
ASR_with_Subword_Tokenization.ipynb
AudioTranslationSample.ipynb
Megatron_Synthetic_Tabular_Data_Generation.ipynb
SpellMapper_English_ASR_Customization.ipynb
FastPitch_ChineseTTS_Training.ipynb
NeVA Tutorial.ipynb
Export
Export Llama70B vLLM causes an out-of-memory issue. It requires more time for the root cause analysis.
Export vLLM does not support LoRA and P-tuning; however, LoRA support will be added in the next release.
In-framework (PyTorch level) deployment with 8 GPUs is encountering an error; more time is needed to understand the cause.
Query script under scripts/deploy/nlp/query.py is giving the An error occurred: ‘output_generation_logits’ error in the 24.12 container. It’ll be fixed in the next container release.
Multimodal - LITA tutorial issue: tutorials/multimodal/LITA_Tutorial.ipynb The data preparation part requires users to manually download the youmakeup dataset instead of using the provided script. - LITA (Language-Independent Tokenization Algorithm) tutorial issue: The data preparation part in tutorials/multimodal/LITA_Tutorial.ipynb requires you to manually download the youmakeup dataset instead of using the provided script. - Add the argument,
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
, to the NeVA notebook pretraining procedure to ensure an end-to-end workflow. Additional argumentexp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
should be added to Neva Notebook Pretraining Part to ensure e2e workflow.ASR - Timestamp misalignment occurs in FastConformer ASR models when using the ASR decoder for diarization. Related Issue: #8438.