nemo_rl.models.generation.vllm.utils#

Module Contents#

Functions#

format_prompt_for_vllm_generation

Format a list of prompts for vllm generation (which requires a specific format for its own generate method).

aggregate_spec_decode_counters

Aggregate speculative decoding counters from multiple workers.

compute_spec_decode_metrics

Compute delta and derived metrics for speculative decoding.

API#

nemo_rl.models.generation.vllm.utils.format_prompt_for_vllm_generation(
data: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationDatumSpec],
sample_idx: Optional[int] = None,
) list[dict[str, Any]]#

Format a list of prompts for vllm generation (which requires a specific format for its own generate method).

See https://docs.vllm.ai/en/v0.9.1/features/multimodal_inputs.html for prompt format for multimodal inputs.

nemo_rl.models.generation.vllm.utils.aggregate_spec_decode_counters(
worker_metrics: list[dict[str, float | list[float]]],
) dict[str | tuple[str, int], float]#

Aggregate speculative decoding counters from multiple workers.

Combines spec decode metrics collected from DP leader workers into a single aggregated counter dictionary.

Parameters:

worker_metrics – List of metric dictionaries from each worker. Each dict maps metric names to float values or lists of floats (for per-position metrics).

Returns:

Dictionary mapping metric names to their aggregated float values. Per-position metrics use (name, position) tuples as keys.

.. rubric:: Example

metrics_from_workers = policy_generation.get_metrics() counters = aggregate_spec_decode_counters(metrics_from_workers) print(counters.get(“vllm:spec_decode_num_drafts”, 0)) 1234.0

nemo_rl.models.generation.vllm.utils.compute_spec_decode_metrics(
start_counters: dict[str | tuple[str, int], float],
end_counters: dict[str | tuple[str, int], float],
) dict[str, float]#

Compute delta and derived metrics for speculative decoding.

Calculates the difference between two counter snapshots and derives acceptance rate and acceptance length metrics for logging.

Parameters:
  • start_counters – Counter snapshot taken before generation.

  • end_counters – Counter snapshot taken after generation.

Returns:

Dictionary of metrics suitable for logging to wandb/tensorboard. Keys are prefixed with “vllm/” for namespace consistency. Includes: - vllm/spec_num_drafts: Total number of draft batches - vllm/spec_num_draft_tokens: Total draft tokens generated - vllm/spec_num_accepted_tokens: Total tokens accepted - vllm/spec_acceptance_length: Average accepted tokens per draft + 1 - vllm/spec_acceptance_rate: Ratio of accepted to draft tokens - vllm/{metric}-{position}: Per-position acceptance counts - vllm/spec_acceptance_rate-pos-{position}: Per-position acceptance rates