Rollout Collection#
A rollout is complete record of a task instance execution that captures:
What the model was asked to do (input)
How the model reasoned (internal processing)
What tools were used (tool calls and tool responses)
How well the task was achieved (verification scores)
The final response (output to user)
Generating Your First Rollouts#
Let’s generate rollouts using the Example Multi Step resource server, which tests reading comprehension across long documents.
head -1 resources_servers/example_multi_step/data/example.jsonl | python -m json.tool
What this dataset contains: Complex reading comprehension tasks where agents must find specific information (“needles”) within long documents (“haystacks”).
Each line in the input JSONL file follows the schema below.
Key components:
responses_create_params: Original task and available tools. Required
metadata (e.g.
expected_synonyms,minefield_label, etc): Additional metadata used by the resources server to either setup or perform verification
{
"responses_create_params": {
"input": [
{
"role": "user",
"content": "What factors contribute to a region experiencing extremely high temperatures, and how do these factors interact?"
}
]
},
"expected_synonyms": [
"Blazing",
"Warm"
],
"minefield_label": "Hot"
}
Start the example_multi_step agent server
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/example_multi_step/configs/example_multi_step.yaml"
ng_run "+config_paths=[${config_paths}]"
✅ Success Check: You should see 3 servers running including the example_multi_step_simple_agent.
In a separate terminal, run:
ng_collect_rollouts +agent_name=example_multi_step_simple_agent \
+input_jsonl_fpath=resources_servers/example_multi_step/data/example.jsonl \
+output_jsonl_fpath=results/example_multi_step_rollouts.jsonl \
+limit=5 \
+num_repeats=2 \
+num_samples_in_parallel=3 \
+responses_create_params.max_output_tokens=8192
What’s happening:
limit=5: Process only the first 5 examples (for quick testing)num_repeats=2: Generate 2 rollouts per example (10 total rollouts)num_samples_in_parallel=3: Process 3 requests simultaneouslymax_output_tokens=8192: Allow longer responses for complex reasoning
Launch the rollout viewer
ng_viewer +jsonl_fpath=results/example_multi_step_rollouts.jsonl
Then visit http://127.0.0.1:7860
What you’ll see: An interactive viewer showing reasoning, tool calls, and verification scores for each rollout.
Key components:
reward: Verification score from the resource server. Required on output
response: Complete output conversation including tool calls and responses
metadata (
parsed_synonym_values,set_overlap, etc): Additional metrics for analysis
{
"responses_create_params": {
"input": [
{
"content": "What factors contribute to a region experiencing extremely high temperatures, and how do these factors interact?",
"role": "user",
"type": "message"
}
]
},
"response": {
"output": [
{
"arguments": "{\"synonym\":\"Blazing\"}",
"name": "get_synonym_value",
"type": "function_call",
},
"..."
]
},
"reward": 1.0,
"parsed_synonym_values": [
711,
407
],
"accuracy": true,
"set_overlap": 1.0,
"original_term_minefield_hit": false,
"order_instruction_following_failure": false
}
Rollout Generation Parameters#
Essential
ng_collect_rollouts \
+agent_name=your_agent_name \ # Which agent to use
+input_jsonl_fpath=input/tasks.jsonl \ # Input dataset
+output_jsonl_fpath=output/rollouts.jsonl # Where to save results
Data Control
+limit=100 \ # Limit examples processed (null = all)
+num_repeats=3 \ # Rollouts per example (null = 1)
+num_samples_in_parallel=5 # Concurrent requests (null = default)
Model Behavior
+responses_create_params.max_output_tokens=4096 \ # Response length limit
+responses_create_params.temperature=0.7 \ # Randomness (0-1)
+responses_create_params.top_p=0.9 # Nucleus sampling