nemo_rl.modelopt.models.generation.vllm_quant_backend#
Module Contents#
Classes#
API#
- class nemo_rl.modelopt.models.generation.vllm_quant_backend.VllmQuantInternalWorkerExtension#
Bases:
nemo_rl.models.generation.vllm.vllm_backend.VllmInternalWorkerExtension- _patch_named_parameters_to_include_buffers(model)#
Temporarily patches model.named_parameters() to also yield input_quantizer buffers.
Weights arrive pre-folded from the Megatron side, so only input_quantizer amax buffers need to be loaded. Weight quantizer buffers are skipped.
- _load_weights(weights)#
Load pre-folded weights and input_quantizer amax buffers.
Weights arrive already folded from the Megatron side (weight_quantizer applied during export), so no fold_weight step is needed here.
- get_weight_snapshot(name: str) torch.Tensor#
Return a CPU copy of a named parameter for before/after comparison.
- export_amax() dict[str, torch.Tensor]#
Export amax buffers from the model for testing/debugging.
- get_quantizer_stats() dict#
Return summary statistics for all TensorQuantizer modules.
Matches the interface of MegatronQuantPolicyWorker.get_quantizer_stats().