nemo_rl.modelopt.models.generation.vllm_quant_worker#
Module Contents#
Classes#
API#
- class nemo_rl.modelopt.models.generation.vllm_quant_worker.VllmQuantGenerationWorker(*args, **kwargs)#
Bases:
nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorkerImplInitialization
Initialize a vLLM worker for distributed inference.
- Parameters:
config – Configuration dictionary for the policy
bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.
fraction_of_gpus – Fraction of GPUs to use for this worker
seed – Random seed for initialization
extra_env_vars – Additional environment variable names to forward into the vLLM worker subprocess (e.g. for quantization configs).
- _create_engine(llm_kwargs: dict[str, Any]) None#
- _collective_rpc_or_empty(method: str) dict[str, Any]#
Best-effort RPC call; returns {} on any failure.
collective_rpc can propagate arbitrary exceptions from the internal worker (RuntimeError, AttributeError, etc.), so broad except is intentional here – consistent with the base class pattern.
- export_amax() dict[str, Any]#
Export amax buffers for testing/debugging.
- get_quantizer_stats() dict[str, Any]#
Return quantizer statistics. Mirrors MegatronQuantPolicyWorker.get_quantizer_stats().
- get_weight_snapshot(name: str) Any#
Return a CPU copy of a named parameter for before/after comparison.
- class nemo_rl.modelopt.models.generation.vllm_quant_worker.VllmQuantAsyncGenerationWorker(*args, **kwargs)#
Bases:
nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorkerImplInitialization
Initialize a vLLM worker for distributed inference.
- Parameters:
config – Configuration dictionary for the policy
bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.
fraction_of_gpus – Fraction of GPUs to use for this worker
seed – Random seed for initialization
extra_env_vars – Additional environment variable names to forward into the vLLM worker subprocess (e.g. for quantization configs).
- _create_engine(llm_kwargs: dict[str, Any]) None#
- async _collective_rpc_or_empty(method: str) dict[str, Any]#
Best-effort async RPC call; returns {} on any failure.
See sync counterpart for rationale on broad except.
- async export_amax() dict[str, Any]#
Export amax buffers for testing/debugging.
- async get_quantizer_stats() dict[str, Any]#
Return quantizer statistics. Mirrors MegatronQuantPolicyWorker.get_quantizer_stats().
- async get_weight_snapshot(name: str) Any#
Return a CPU copy of a named parameter for before/after comparison.