nemo_rl.experience.rollouts#

Module Contents#

Functions#

generate_responses

Generate responses from policy using synchronous generation.

generate_responses_async

Async version of generate_responses that properly calls generate_async.

calculate_rewards

Calculate rewards for generated responses and get environment feedback.

run_multi_turn_rollout

Runs a multi-turn rollout loop, interacting with the environment.

async_generate_response_for_sample_turn

Generate a response for a single sample’s turn using async generation.

run_sample_multi_turn_rollout

Run a multi-turn rollout for a single sample.

run_async_multi_turn_rollout

Run multi-turn rollouts with sample-level processing.

Data#

API#

nemo_rl.experience.rollouts.TokenizerType#

None

nemo_rl.experience.rollouts.generate_responses(
policy_generation: nemo_rl.models.generation.interfaces.GenerationInterface,
generation_input_data: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationDatumSpec],
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec],
tokenizer: nemo_rl.experience.rollouts.TokenizerType,
input_lengths: torch.Tensor,
include_logprobs: bool = True,
greedy: bool = False,
) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec], list[torch.Tensor], dict[str, float | int]][source]#

Generate responses from policy using synchronous generation.

async nemo_rl.experience.rollouts.generate_responses_async(
policy_generation: nemo_rl.models.generation.interfaces.GenerationInterface,
generation_input_data: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.models.generation.interfaces.GenerationDatumSpec],
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec],
tokenizer: nemo_rl.experience.rollouts.TokenizerType,
input_lengths: torch.Tensor,
include_logprobs: bool = True,
greedy: bool = False,
) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec], list[torch.Tensor], dict[str, float | int]][source]#

Async version of generate_responses that properly calls generate_async.

nemo_rl.experience.rollouts.calculate_rewards(
batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec],
task_to_env: dict[str, nemo_rl.environments.interfaces.EnvironmentInterface],
) nemo_rl.environments.interfaces.EnvironmentReturn[source]#

Calculate rewards for generated responses and get environment feedback.

Parameters:
  • batch – Batch containing message_log (LLMMessageLogType) with generated responses

  • task_to_env – Dictionary mapping task names to their corresponding environments

Returns:

  • observations: List of observations from the environment for the next turn.

  • metadata: List of extracted metadata from the environment.

  • next_stop_strings: List of stop strings for the next generation step.

  • rewards: Tensor of rewards for the last turn.

  • terminateds: Tensor of booleans indicating if an episode ended naturally.

Return type:

EnvironmentReturn namedtuple containing

nemo_rl.experience.rollouts.run_multi_turn_rollout(
policy_generation: nemo_rl.models.generation.interfaces.GenerationInterface,
input_batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec],
tokenizer: nemo_rl.experience.rollouts.TokenizerType,
task_to_env: dict[str, nemo_rl.environments.interfaces.EnvironmentInterface],
max_seq_len: int,
max_rollout_turns: int = 999999,
greedy: bool = False,
) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec], dict[str, Any]][source]#

Runs a multi-turn rollout loop, interacting with the environment.

Parameters:
  • policy_generation – The generation interface (policy).

  • input_batch – The starting batch containing initial message logs.

  • tokenizer – The tokenizer.

  • task_to_env – Dictionary mapping task names to environment instances.

  • max_rollout_turns – Maximum number of agent-environment interaction turns.

  • max_seq_len – Maximum sequence length allowed.

  • greedy – Whether to use greedy decoding.

Returns:

  • BatchedDataDict with the full interaction history and accumulated rewards

  • Dictionary of rollout metrics

Return type:

Tuple containing

async nemo_rl.experience.rollouts.async_generate_response_for_sample_turn(
policy_generation: nemo_rl.models.generation.interfaces.GenerationInterface,
sample_message_log: list[dict],
sample_stop_strings: list[str] | None,
tokenizer: nemo_rl.experience.rollouts.TokenizerType,
max_seq_len: int,
greedy: bool = False,
) tuple[list[dict], torch.Tensor, torch.Tensor, dict[str, float]][source]#

Generate a response for a single sample’s turn using async generation.

Parameters:
  • policy_generation – The generation interface to use

  • sample_message_log – Message log for a single sample

  • sample_stop_strings – Stop strings for this sample

  • tokenizer – Tokenizer to use

  • max_seq_len – Maximum sequence length

  • greedy – Whether to use greedy decoding

Returns:

Tuple of (updated_message_log, generated_tokens, input_lengths, generation_metrics)

async nemo_rl.experience.rollouts.run_sample_multi_turn_rollout(
sample_idx: int,
initial_sample_state: dict,
policy_generation: nemo_rl.models.generation.interfaces.GenerationInterface,
tokenizer: nemo_rl.experience.rollouts.TokenizerType,
task_to_env: dict[str, nemo_rl.environments.interfaces.EnvironmentInterface],
max_seq_len: int,
max_rollout_turns: int = 999999,
greedy: bool = False,
) tuple[dict, dict[str, Any]][source]#

Run a multi-turn rollout for a single sample.

This function manages the complete lifecycle of one sample’s interaction. Async generation is used internally when available.

Parameters:
  • sample_idx – Index of this sample in the original batch

  • initial_sample_state – Initial state containing message_log, extra_env_info, etc.

  • policy_generation – The generation interface

  • tokenizer – Tokenizer to use

  • task_to_env – Environment mapping

  • max_seq_len – Maximum sequence length

  • max_rollout_turns – Maximum number of turns

  • greedy – Whether to use greedy decoding

Returns:

Tuple of (final_sample_state, sample_metrics)

nemo_rl.experience.rollouts.run_async_multi_turn_rollout(
policy_generation: nemo_rl.models.generation.interfaces.GenerationInterface,
input_batch: nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec],
tokenizer: nemo_rl.experience.rollouts.TokenizerType,
task_to_env: dict[str, nemo_rl.environments.interfaces.EnvironmentInterface],
max_seq_len: int,
max_rollout_turns: int = 999999,
greedy: bool = False,
) tuple[nemo_rl.distributed.batched_data_dict.BatchedDataDict[nemo_rl.data.interfaces.DatumSpec], dict[str, Any]][source]#

Run multi-turn rollouts with sample-level processing.

Each sample in the batch proceeds through its interaction independently. Async generation is used internally when available but the function is synchronous.

Parameters:
  • policy_generation – The generation interface (policy)

  • input_batch – The starting batch containing initial message logs

  • tokenizer – The tokenizer

  • task_to_env – Dictionary mapping task names to environment instances

  • max_seq_len – Maximum sequence length allowed

  • max_rollout_turns – Maximum number of agent-environment interaction turns

  • greedy – Whether to use greedy decoding

Returns:

  • BatchedDataDict with the full interaction history and accumulated rewards

  • Dictionary of rollout metrics

Return type:

Tuple containing