nemo_rl.modelopt.models.generation.vllm_quant_worker#

Module Contents#

Classes#

API#

class nemo_rl.modelopt.models.generation.vllm_quant_worker.VllmQuantGenerationWorker(*args, **kwargs)#

Bases: nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorkerImpl

Initialization

Initialize a vLLM worker for distributed inference.

Parameters:
  • config – Configuration dictionary for the policy

  • bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.

  • fraction_of_gpus – Fraction of GPUs to use for this worker

  • seed – Random seed for initialization

  • extra_env_vars – Additional environment variable names to forward into the vLLM worker subprocess (e.g. for quantization configs).

_create_engine(llm_kwargs: dict[str, Any]) None#
_collective_rpc_or_empty(method: str) dict[str, Any]#

Best-effort RPC call; returns {} on any failure.

collective_rpc can propagate arbitrary exceptions from the internal worker (RuntimeError, AttributeError, etc.), so broad except is intentional here – consistent with the base class pattern.

export_amax() dict[str, Any]#

Export amax buffers for testing/debugging.

get_quantizer_stats() dict[str, Any]#

Return quantizer statistics. Mirrors MegatronQuantPolicyWorker.get_quantizer_stats().

get_weight_snapshot(name: str) Any#

Return a CPU copy of a named parameter for before/after comparison.

class nemo_rl.modelopt.models.generation.vllm_quant_worker.VllmQuantAsyncGenerationWorker(*args, **kwargs)#

Bases: nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorkerImpl

Initialization

Initialize a vLLM worker for distributed inference.

Parameters:
  • config – Configuration dictionary for the policy

  • bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.

  • fraction_of_gpus – Fraction of GPUs to use for this worker

  • seed – Random seed for initialization

  • extra_env_vars – Additional environment variable names to forward into the vLLM worker subprocess (e.g. for quantization configs).

_create_engine(llm_kwargs: dict[str, Any]) None#
async _collective_rpc_or_empty(method: str) dict[str, Any]#

Best-effort async RPC call; returns {} on any failure.

See sync counterpart for rationale on broad except.

async export_amax() dict[str, Any]#

Export amax buffers for testing/debugging.

async get_quantizer_stats() dict[str, Any]#

Return quantizer statistics. Mirrors MegatronQuantPolicyWorker.get_quantizer_stats().

async get_weight_snapshot(name: str) Any#

Return a CPU copy of a named parameter for before/after comparison.