nemo_rl.modelopt.models.generation.vllm_quant_backend#

Module Contents#

Classes#

API#

class nemo_rl.modelopt.models.generation.vllm_quant_backend.VllmQuantInternalWorkerExtension#

Bases: nemo_rl.models.generation.vllm.vllm_backend.VllmInternalWorkerExtension

_patch_named_parameters_to_include_buffers(model)#

Temporarily patches model.named_parameters() to also yield input_quantizer buffers.

Weights arrive pre-folded from the Megatron side, so only input_quantizer amax buffers need to be loaded. Weight quantizer buffers are skipped.

_load_weights(weights)#

Load pre-folded weights and input_quantizer amax buffers.

Weights arrive already folded from the Megatron side (weight_quantizer applied during export), so no fold_weight step is needed here.

get_weight_snapshot(name: str) torch.Tensor#

Return a CPU copy of a named parameter for before/after comparison.

export_amax() dict[str, torch.Tensor]#

Export amax buffers from the model for testing/debugging.

get_quantizer_stats() dict#

Return summary statistics for all TensorQuantizer modules.

Matches the interface of MegatronQuantPolicyWorker.get_quantizer_stats().