Model Quirks#

This document outlines special cases and model-specific behaviors that require custom handling in NeMo RL. These special cases are controlled by the ModelFlag enum.

Gemma-3#

Tied Weights#

Weight tying between the embedding layer (model.embed_tokens) and output layer (lm_head) is currently not respected when using the FSDP1 policy or the DTensor policy when TP > 1 (See this issue). To avoid errors when training these models, we only allow training models with tied weights using the DTensor policy with TP=1. For Llama-3 and Qwen2.5 models, weight-tying is only enabled for the smaller models (< 2B), which can typically be trained without tensor parallelism. For Gemma-3, all model sizes have weight-tying enabled, including the larger models which require tensor parallelism. To support training of these models, we specially handle the Gemma-3 models by allowing training using the DTensor policy with TP > 1.

Special Handling:

  • We skip the tied weights check for all Gemma-3 models when using the DTensor policy, allowing training using TP > 1.

  • We exclude model.embed_tokens and lm_head from the DTensor tensor parallel plan to maintain weight tying correctly.

vLLM Initialization#

Gemma-3 models have a specific issue with vLLM dummy weight initialization due to a vLLM bug where a normalizer buffer is created that is not present in the Hugging Face model. This causes the normalizer buffer to be set to dummy weights at initialization and then never updated with the correct values during model refit. As a workaround for this issue, we do not use dummy weight initialization for vLLM with Gemma-3 models and instead use the load_format="auto" setting to load the full weights at initialization.

Special Handling:

  • We automatically use load_format="auto" for Gemma-3 models when initializing vLLM.

  • This avoids issues with dummy weight initialization, where the dummy weights for this buffer would never get overwritten during refit.