Known Issues#

This page lists known issues and limitations in the current release.

26.06#

Moonlight and Nemotron v3 Nano model training recipe shows performance degradation with TP > 1. As a workaround, set TP=1 and use HybridEP. We have root caused this regression to the base PyTorch image upgrade from 26.02 to 26.04. We are actively investigating and looking to fix this regression soon.
Step-3.7-Flash forward-pass outputs have not been fully verified.
Some examples/scripts have known minor issues: MiniMax M2 (conversion/export saving), GLM-4.5V (exported tokenizer artifacts), FLUX (tokenizer setup), and WAN (inference setup/dependencies).

The following video / image decoding packages are no longer installed by default in the NeMo Framework 26.04 container (nvcr.io/nvidia/nemo:26.04) to mitigate CVEs in their vendored native binaries:
- av (PyAV)
- decord
- opencv-python-headless
Workflows that depend on any of these (for example, multimodal video pipelines, qwen-vl-utils video paths, or decord[av-decode]) must reinstall them at runtime — see docker/common/README.md for instructions.

AWS EKS only: Due to AWS-OFI-NCCL v1.17.0 long-running jobs suffer a memory leak that causes performance regression over time. This can be mitigated by upgrading to v1.17.3.
Context parallelism with sequence packing are not yet supported for Qwen 3 VL in the r0.3.0 release. Fixed in 26.02.01 (r0.3.1).
DeepEP is not supported in the current NeMo framework 26.02 container (nvcr.io/nvidia/nemo:26.02), which results in reduced DSv3 performance compared to the NeMo framework 25.09 container (nvcr.io/nvidia/nemo:25.09) on H100 machines. For optimal H100 performance, we recommend using the NeMo framework 25.09 container.

Deepseek V3 on H100 has an issue when using DeepEP and fails with RuntimeError: DeepEP error: timeout (dispatch CPU).
MODEL_TFLOP/s/GPU is printed as 0 to stdout for all Hybrid models, such as Nemotron-H 56B.

Pretraining DeepSeek in subchannel FP8 precision is not working. Pretraining DeepSeek with current scaling FP8 is a workaround, but MTP loss does not converge.