`nemo_rl.models.generation.vllm.vllm_backend`#

Module Contents#

Classes#

VllmInternalWorkerExtension

Functions#

`fix_gemma3_vision_weight_name`	Re-insert the `vision_model` segment into Gemma3 vision-tower weights.
`_read_mtp_layer_weights_from_checkpoint`	Read only the MTP draft layer weights from a sharded HF safetensors checkpoint.

API#

nemo_rl.models.generation.vllm.vllm_backend.fix_gemma3_vision_weight_name(key: str) → str#

Re-insert the vision_model segment into Gemma3 vision-tower weights.

When performing refit, the vision-tower weight paths are flattened. This unflattens them.

nemo_rl.models.generation.vllm.vllm_backend._read_mtp_layer_weights_from_checkpoint( model_path: str, mtp_layer_indices: set[int], ) → list[tuple[str, torch.Tensor]]#

Read only the MTP draft layer weights from a sharded HF safetensors checkpoint.

Uses the checkpoint’s model.safetensors.index.json to open only the shards that contain the requested transformer layer indices, so the multi-terabyte base-model weights are never read from disk.

Parameters:

model_path – Path to the HF checkpoint directory.
mtp_layer_indices – Transformer layer indices belonging to the MTP module(s).

Returns:

A list of (weight_name, tensor) pairs for the requested layers, with tensors on CPU.

class nemo_rl.models.generation.vllm.vllm_backend.VllmInternalWorkerExtension#

bind_numa() → bool#

Pin this TP worker to its GPU’s NUMA-local CPUs/memory.

Invoked via collective_rpc on each vLLM TP worker once the engine (and CUDA) is up, so the worker’s physical GPU id is resolved from its local device index (see resolve_visible_gpu_id).

init_collective( rank_prefix: int, ip: str, port: int, world_size: int, train_world_size: int, ) → None#: Initialize the collective communication.

report_device_id() → str#: Retrieve the UUID of the current CUDA device.

get_zmq_address()#: Get the ZMQ address for the current device.

maybe_init_zmq()#: Initialize the ZMQ socket if it doesn’t exist.

prepare_refit_info(state_dict_info: dict[str, Any]) → None#

Prepare state dict metadata for weight refitting and IPC streaming.

Parameters:: state_dict_info (dict) – A dictionary containing the info for refit. e.g. {tensor_name: (shape, dtype)}

_maybe_process_fp8_kv_cache() → None#: Process weights after loading for FP8 KV cache (static scales).

static _split_policy_and_draft_weights( weights: list[tuple[str, torch.Tensor]], ) → tuple[list[tuple[str, torch.Tensor]], list[tuple[str, torch.Tensor]]]#

Split trainer-owned draft weights from policy weights.

This path is only used for the Eagle3 online-training flow, where the trainer exports draft parameters under a draft. prefix before sending them to vLLM. This implementation is specific to the eagle model. For MTP, we can add similar logic to this function to split weights and send it to the drafter. The “draft.” prefix is added here https://github.com/isomap/RL/blob/d3a5e1396d00f82fb888d9ec6800687a23bb4017/nemo_rl/models/policy/workers/megatron_policy_worker.py#L967-L997

static _trim_vocab_padding( draft_model: torch.nn.Module, draft_weights: list[tuple[str, torch.Tensor]], ) → list[tuple[str, torch.Tensor]]#

Trim padded vocab dimensions from draft weights.

Megatron pads vocab to a multiple, but vLLM 0.20’s autoloader strictly asserts loaded_weight.shape[0] == org_vocab_size on VocabParallelEmbedding layers. Each such layer may have a different org_vocab_size (e.g. embed_tokens uses vocab_size while lm_head uses draft_vocab_size), so we match each weight to its target module by name.

_load_draft_weights( draft_weights: list[tuple[str, torch.Tensor]], ) → None#

load_mtp_weights_from_disk(model_path: str) → bool#

Load only the MTP (multi-token-prediction) draft weights from disk.

Used when an MTP speculative-decoding policy runs with load_format="dummy": the main model receives real weights via refit, but the MTP draft layer is not covered by refit (the trainer runs with mtp_num_layers=0), so its weights must come from the checkpoint. Only the MTP layer(s) are read, avoiding a full base-model load (~1.3 TB for DeepSeek-V3) on every inference replica.

Parameters:: model_path – Path to the HF checkpoint directory.
Returns:: True if MTP weights were loaded.
Return type:: bool

_load_weights(weights)#

Load weights with Gemma3 vision-tower weight name fix, FP8, and draft-weight support.

Applies Gemma3 vision-tower weight name fix if needed, splits policy/draft weights, applies FP8 conversion if needed, and loads draft weights into the drafter model.

update_weights_via_ipc_zmq() → bool#

Receive and update model weights via ZMQ IPC socket.

Returns:: True if weights were successfully updated.
Return type:: bool

update_weights_from_collective() → bool#: Update the model weights from collective communication.

cleanup() → None#: Shutdown and cleanup resources.

start_gpu_profiling() → None#: Start GPU profiling.

stop_gpu_profiling() → None#: Stop GPU profiling.

nemo_rl.models.generation.vllm.vllm_backend#

Module Contents#

Classes#

Functions#

API#

`nemo_rl.models.generation.vllm.vllm_backend`#