`nemo_rl.models.generation.vllm.vllm_worker_async`#

Module Contents#

Classes#

VllmAsyncGenerationWorker

Functions#

_replace_prefix_tokens

This is a subroutine used inside the vLLM Chat Completion server.

API#

nemo_rl.models.generation.vllm.vllm_worker_async._replace_prefix_tokens( tokenizer, model_prefix_token_ids: list[int], template_prefix_token_ids: list[int], template_token_ids: list[int], ) → list[int]#

This is a subroutine used inside the vLLM Chat Completion server.

This function is for fixing up the chat template-tokenized messages history to match the model output tokenization up to the last assistant turn, in order to preserve the monotonic tokens property for optimized multi-turn training.

Some environments (namely Penguin) require an OpenAI compatible server endpoint rather than an inference engine handle. This is fine for the most part, but it may cause issues when the environment is used as a part of training.

RL training frameworks train models on token IDs, but the OpenAI compatible server communicates in what is basically de-tokenized text. When multiple model calls are made to the OpenAI compatible server in a single trajectory, model generations in previous model calls may be re-tokenized to something that is different than what was generated. This is not too big of an issue (that we know of) at inference time, but the log probs the model produces are different enough for the differently re-tokenized generation result that it causes the training to be off policy. Off policy isn’t necessarily a bad thing in isolation, but this source of off-policyness may cause unexpected issues if not properly accounted for. It also mis-aligns the token ID sequences across model calls, which feels very strange during training.

There are real cases where the model output string does not match the chat template tokenization of the parsed model output. A concrete example is inconsistent whitespace tokens around tool call special tokens.

TODO When NeMo RL supports training image generation models, we want to revisit and possibly update this function. This issue occurs when the model generates tokens that are de-tokenized into text or images, and then re-tokenized into tokens. So if there is a situation like that with images and image tokenization is non-unique, then we will need to uppdate this function.

Example (turn-by-turn, concise; eos_token_id = 2): Turn 1: - prefill_T1 (template prefill) = [11,12,13,40,41] - model output = [220,17,2] # decodes to “ 4” + EOS - model_prefix_token_ids = prefill_T1 + model output => [11,12,13,40,41,220,17,2]

Turn 2 (template retokenizes prior assistant text differently):
    - template_prefix_token_ids = [11,12,13,40,41,1001,2]  # 1001 decodes to " 4"
    - template_token_ids = [11,12,13,40,41,1001,2,21,22,40,41]

_replace_prefix_tokens keeps the exact prior model tokens up to EOS and
resumes from the template after that EOS:
    output => [11,12,13,40,41,220,17,2,21,22,40,41]

class nemo_rl.models.generation.vllm.vllm_worker_async.VllmAsyncGenerationWorker( config: nemo_rl.models.generation.vllm.config.VllmConfig, bundle_indices: Optional[list[int]] = None, fraction_of_gpus: float = 1.0, seed: Optional[int] = None, )#

Bases: nemo_rl.models.generation.vllm.vllm_worker.BaseVllmGenerationWorker

Initialization

Initialize a vLLM worker for distributed inference.

Parameters:

config – Configuration dictionary for the policy
bundle_indices – List of local bundle indices within a node for parallelism. Only needed for the first worker in each tied worker group.
fraction_of_gpus – Fraction of GPUs to use for this worker
seed – Random seed for initialization

_create_engine(llm_kwargs: dict[str, Any]) → None#

async post_init_async()#

async report_dp_openai_server_base_url() → Optional[str]#

_setup_vllm_openai_api_server(app: fastapi.FastAPI) → fastapi.FastAPI#

_setup_vllm_server() → tuple[threading.Thread, str, uvicorn.Server]#

async init_collective_async( rank_prefix: int, ip: str, port: int, world_size: int, train_world_size: int, ) → None#

async generate_async( data: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationDatumSpec], greedy: bool = False, ) → AsyncGenerator[tuple[int, nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationOutputSpec]], None]#

Generate a batch of data using vLLM’s AsyncLLMEngine, yielding results as they are ready.

Parameters:

data – BatchedDataDict with input_ids and input_lengths
greedy – Whether to use greedy decoding instead of sampling

Yields:

Tuple of (original_index, BatchedDataDict conforming to GenerationOutputSpec for the single sequence)

async generate_text_async( data: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationDatumSpec], greedy: bool = False, ) → AsyncGenerator[tuple[int, nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationOutputSpec]], None]#

Generate text responses asynchronously, yielding results as they are ready.

Parameters:

data – BatchedDataDict containing prompts with text strings
greedy – Whether to use greedy decoding instead of sampling

Yields:

Tuple of (original_index, BatchedDataDict containing single text response)

async report_device_id_async() → list[str]#: Async version of report_device_id.

async prepare_refit_info_async( state_dict_info: dict[str, Any], ) → None#: Async version of prepare_refit_info.

async update_weights_via_ipc_zmq_async() → bool#: Async version of update_weights_via_ipc_zmq.

async update_weights_from_collective_async() → bool#: Async version of update_weights_from_collective.

async reset_prefix_cache_async()#: Async version of reset_prefix_cache.

async sleep_async()#: Async version of sleep.

async wake_up_async(**kwargs)#: Async version of wake_up.

async shutdown() → bool#: Clean up vLLM resources.

nemo_rl.models.generation.vllm.vllm_worker_async#

Module Contents#

Classes#

Functions#

API#

`nemo_rl.models.generation.vllm.vllm_worker_async`#