`nemo_rl.modelopt.models.generation.vllm_quant_worker`#

Module Contents#

Classes#

`VllmQuantGenerationWorker`
`VllmQuantAsyncGenerationWorker`

API#

class nemo_rl.modelopt.models.generation.vllm_quant_worker.VllmQuantGenerationWorker(*args, **kwargs)#

Bases: nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorkerImpl

Initialization

Initialize a vLLM worker for distributed inference.

Parameters:

config – Configuration dictionary for the policy
bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.
fraction_of_gpus – Fraction of GPUs to use for this worker
seed – Random seed for initialization
extra_env_vars – Additional environment variable names to forward into the vLLM worker subprocess (e.g. for quantization configs).

_create_engine(llm_kwargs: dict[str, Any]) → None#

_collective_rpc_or_empty(method: str) → dict[str, Any]#

Best-effort RPC call; returns {} on any failure.

collective_rpc can propagate arbitrary exceptions from the internal worker (RuntimeError, AttributeError, etc.), so broad except is intentional here – consistent with the base class pattern.

export_amax() → dict[str, Any]#: Export amax buffers for testing/debugging.

get_quantizer_stats() → dict[str, Any]#: Return quantizer statistics. Mirrors MegatronQuantPolicyWorker.get_quantizer_stats().

get_weight_snapshot(name: str) → Any#: Return a CPU copy of a named parameter for before/after comparison.

class nemo_rl.modelopt.models.generation.vllm_quant_worker.VllmQuantAsyncGenerationWorker(*args, **kwargs)#

Bases: nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorkerImpl

Initialization

Initialize a vLLM worker for distributed inference.

Parameters:

config – Configuration dictionary for the policy
bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.
fraction_of_gpus – Fraction of GPUs to use for this worker
seed – Random seed for initialization
extra_env_vars – Additional environment variable names to forward into the vLLM worker subprocess (e.g. for quantization configs).

_create_engine(llm_kwargs: dict[str, Any]) → None#

async _collective_rpc_or_empty(method: str) → dict[str, Any]#

Best-effort async RPC call; returns {} on any failure.

See sync counterpart for rationale on broad except.

async export_amax() → dict[str, Any]#: Export amax buffers for testing/debugging.

async get_quantizer_stats() → dict[str, Any]#: Return quantizer statistics. Mirrors MegatronQuantPolicyWorker.get_quantizer_stats().

async get_weight_snapshot(name: str) → Any#: Return a CPU copy of a named parameter for before/after comparison.

nemo_rl.modelopt.models.generation.vllm_quant_worker#

Module Contents#

Classes#

API#

`nemo_rl.modelopt.models.generation.vllm_quant_worker`#