`nemo_rl.models.generation.vllm.utils`#

Module Contents#

Functions#

`format_prompt_for_vllm_generation`	Format a list of prompts for vllm generation (which requires a specific format for its own `generate` method).
`aggregate_spec_decode_counters`	Aggregate speculative decoding counters from multiple workers.
`compute_spec_decode_metrics`	Compute delta and derived metrics for speculative decoding.
`resolve_generation_worker_cls`	Return the quantized vLLM generation worker FQN if `quant_cfg` is set, else `default_cls`.

Data#

GENERATION_WORKER_OVERRIDES

API#

nemo_rl.models.generation.vllm.utils.format_prompt_for_vllm_generation( data: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationDatumSpec], sample_idx: Optional[int] = None, ) → list[dict[str, Any]]#

Format a list of prompts for vllm generation (which requires a specific format for its own generate method).

See https://docs.vllm.ai/en/v0.9.1/features/multimodal_inputs.html for prompt format for multimodal inputs.

nemo_rl.models.generation.vllm.utils.aggregate_spec_decode_counters( worker_metrics: list[dict[str, float | list[float]]], ) → dict[str | tuple[str, int], float]#

Aggregate speculative decoding counters from multiple workers.

Combines spec decode metrics collected from DP leader workers into a single aggregated counter dictionary.

Parameters:: worker_metrics – List of metric dictionaries from each worker. Each dict maps metric names to float values or lists of floats (for per-position metrics).
Returns:: Dictionary mapping metric names to their aggregated float values. Per-position metrics use (name, position) tuples as keys.

.. rubric:: Example

metrics_from_workers = policy_generation.get_metrics() counters = aggregate_spec_decode_counters(metrics_from_workers) print(counters.get(“vllm:spec_decode_num_drafts”, 0)) 1234.0

nemo_rl.models.generation.vllm.utils.compute_spec_decode_metrics( start_counters: dict[str | tuple[str, int], float], end_counters: dict[str | tuple[str, int], float], ) → dict[str, float]#

Compute delta and derived metrics for speculative decoding.

Calculates the difference between two counter snapshots and derives acceptance rate and acceptance length metrics for logging.

Parameters:

start_counters – Counter snapshot taken before generation.
end_counters – Counter snapshot taken after generation.

Returns:

Dictionary of metrics suitable for logging to wandb/tensorboard. Keys are prefixed with “vllm/” for namespace consistency. Includes: - vllm/spec_num_drafts: Total number of draft batches - vllm/spec_num_draft_tokens: Total draft tokens generated - vllm/spec_num_accepted_tokens: Total tokens accepted - vllm/spec_acceptance_length: Average accepted tokens per draft + 1 - vllm/spec_acceptance_rate: Ratio of accepted to draft tokens - vllm/{metric}-{position}: Per-position acceptance counts - vllm/spec_acceptance_rate-pos-{position}: Per-position acceptance rates

nemo_rl.models.generation.vllm.utils.GENERATION_WORKER_OVERRIDES#: None

nemo_rl.models.generation.vllm.utils.resolve_generation_worker_cls(default_cls: str, config: dict) → str#

Return the quantized vLLM generation worker FQN if quant_cfg is set, else default_cls.

Safe to call even when ModelOpt is not installed — returns default_cls unchanged whenever quant_cfg is None, so the core generation path stays import-free of ModelOpt.

nemo_rl.models.generation.vllm.utils#

Module Contents#

Functions#

Data#

API#

`nemo_rl.models.generation.vllm.utils`#