Prepare and Validate Data#
Format and validate JSONL datasets for NeMo Gym training using ng_prepare_data.
Goal: Validate data format and prepare datasets for training.
Time: ~15 minutes
In this guide, you will:
Validate datasets with
ng_prepare_dataGenerate training and validation splits
Understand the JSONL data format
Prerequisites:
NeMo Gym installed (Detailed Setup Guide)
Quick Start#
From the repository root:
ng_prepare_data \
"+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml]" \
+output_dirpath=data/test \
+mode=example_validation
Success output:
####################################################################################################
#
# Finished!
#
####################################################################################################
This generates two types of output:
Per-dataset metrics:
resources_servers/example_multi_step/data/example_metrics.json(alongside source JSONL)Aggregated metrics:
data/test/example_metrics.json(in output directory)
Data Format#
NeMo Gym uses JSONL files. Each line requires a responses_create_params field following the OpenAI Responses API schema.
Minimal Format#
{"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}}
With Verification Fields#
Most resources servers add fields for reward computation:
{
"responses_create_params": {
"input": [{"role": "user", "content": "What is 15 * 7? Put your answer in \\boxed{}."}]
},
"question": "What is 15 * 7?",
"expected_answer": "105"
}
Tip
Check resources_servers/<name>/README.md for required fields specific to each resources server.
Key Properties#
Property |
Type |
Description |
|---|---|---|
|
string or list |
Required. User query or message list |
|
list |
Tool definitions for function calling |
|
bool |
Allow parallel tool calls (default: |
|
float |
Sampling temperature |
|
int |
Maximum response tokens |
Message Roles#
Role |
Use |
|---|---|
|
User queries |
|
Model responses (multi-turn) |
|
System instructions (preferred) |
|
System instructions (legacy) |
Preprocess Raw Datasets#
If your dataset doesn’t have responses_create_params, you need to preprocess it before using ng_prepare_data.
When to preprocess:
Downloaded datasets without NeMo Gym format
Custom data needing system prompts
Need to split into train/validation sets
Add responses_create_params#
The responses_create_params field wraps your input in the Responses API format. This typically includes a system prompt and the user content.
Preprocessing script (preprocess.py)
Save this script as preprocess.py. It reads a raw JSONL file, adds responses_create_params, and splits into train/validation:
import json
import os
# Configuration — customize these for your dataset
INPUT_FIELD = "problem" # Field containing the input text (e.g., "problem", "question", "prompt")
FILENAME = "raw_data.jsonl"
SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
TRAIN_RATIO = 0.999 # 99.9% train, 0.1% validation
dirpath = os.path.dirname(FILENAME) or "."
with open(FILENAME, "r", encoding="utf-8") as fin, \
open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:
lines = list(fin)
split_idx = int(len(lines) * TRAIN_RATIO)
for i, line in enumerate(lines):
if not line.strip():
continue
row = json.loads(line)
# Remove fields not needed for training (optional)
row.pop("generated_solution", None)
row.pop("problem_source", None)
# Add responses_create_params
row["responses_create_params"] = {
"input": [
{"role": "developer", "content": SYSTEM_PROMPT},
{"role": "user", "content": row.get(INPUT_FIELD, "")},
]
}
out = json.dumps(row) + "\n"
(ftrain if i < split_idx else fval).write(out)
Important
You must customize these variables for your dataset:
INPUT_FIELD: The field name containing your input text. Common values:"problem"(math),"question"(QA),"prompt"(general),"instruction"(instruction-following)SYSTEM_PROMPT: Task-specific instructions for the modelTRAIN_RATIO: Train/validation split ratio
Run and verify:
uv run preprocess.py
wc -l train.jsonl validation.jsonl
Create Config for Custom Data#
After preprocessing, create a config file to point ng_prepare_data at your local files.
Example config: custom_data.yaml
custom_resources_server:
resources_servers:
custom_server:
entrypoint: app.py
domain: math # math | coding | agent | knowledge | other
description: Custom math dataset
verified: false
custom_simple_agent:
responses_api_agents:
simple_agent:
entrypoint: app.py
resources_server:
type: resources_servers
name: custom_resources_server
model_server:
type: responses_api_models
name: policy_model
datasets:
- name: train
type: train
jsonl_fpath: train.jsonl
license: Creative Commons Attribution 4.0 International
- name: validation
type: validation
jsonl_fpath: validation.jsonl
license: Creative Commons Attribution 4.0 International
Run data preparation:
config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,custom_data.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data
This validates your data and adds the agent_ref field to each row, routing samples to your resource server.
Validation Modes#
Mode |
Purpose |
Validates |
|---|---|---|
|
PR submission |
|
|
Training prep |
|
Example Validation#
ng_prepare_data "+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml]" \
+output_dirpath=data/example_multi_step \
+mode=example_validation
Training Preparation#
ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/workplace_assistant.yaml]" \
+output_dirpath=data/workplace_assistant \
+mode=train_preparation \
+should_download=true
CLI Parameters#
Parameter |
Required |
Description |
|---|---|---|
|
Yes |
YAML config paths |
|
Yes |
Output directory |
|
Yes |
|
|
No |
Download missing datasets (default: |
|
No |
|
Troubleshooting#
Issue |
Symptom |
Fix |
|---|---|---|
Missing |
Sample silently skipped |
Add field with valid |
Invalid JSON |
Sample skipped |
Fix JSON syntax |
Invalid role |
Sample skipped |
Use |
Missing dataset file |
|
Create file or set |
Warning
Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.
Find invalid samples
import json
def validate_sample(line: str) -> tuple[bool, str]:
try:
data = json.loads(line)
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {e}"
if "responses_create_params" not in data:
return False, "Missing 'responses_create_params'"
if "input" not in data["responses_create_params"]:
return False, "Missing 'input' in responses_create_params"
return True, "OK"
with open("your_data.jsonl") as f:
for i, line in enumerate(f, 1):
valid, msg = validate_sample(line)
if not valid:
print(f"Line {i}: {msg}")
Validation Process#
ng_prepare_data performs these steps:
Load configs — Parse server configs, identify datasets
Check files — Verify dataset files exist
Validate samples — Parse each line, validate against schema
Compute metrics — Aggregate statistics
Collate — Combine samples with agent references
Output Locations#
Metrics files are written to two locations:
Per-dataset:
{dataset_jsonl_path}_metrics.json— alongside each source JSONL fileAggregated:
{output_dirpath}/{type}_metrics.json— combined metrics per dataset type
Re-Running#
Output files (
train.jsonl,validation.jsonl) are overwritten inoutput_dirpathMetrics files (
*_metrics.json) are compared — delete them if your data changed
Generated Metrics#
Metric |
Description |
|---|---|
Number of examples |
Valid sample count |
Number of tools |
Tool count stats (avg/min/max/stddev) |
Number of turns |
User messages per sample |
Temperature |
Temperature parameter stats |
Example metrics file
{
"name": "example",
"type": "example",
"jsonl_fpath": "resources_servers/example_multi_step/data/example.jsonl",
"Number of examples": 5,
"Number of tools": {
"Total # non-null values": 5,
"Average": 2.0,
"Min": 2.0,
"Max": 2.0
}
}
Dataset Configuration#
Define datasets in your server’s YAML config:
datasets:
- name: train
type: train
jsonl_fpath: resources_servers/my_server/data/train.jsonl
license: Apache 2.0
- name: validation
type: validation
jsonl_fpath: resources_servers/my_server/data/validation.jsonl
license: Apache 2.0
- name: example
type: example
jsonl_fpath: resources_servers/my_server/data/example.jsonl
Type |
Purpose |
Required for |
|---|---|---|
|
Small sample (~5 rows) for format checks |
PR submission |
|
Training data |
RL training |
|
Evaluation during training |
RL training |
Next Steps#
Generate training examples by running your agent on prepared data.
Use validated data with NeMo RL for GRPO training.