Prepare and Validate
Format and validate JSONL datasets for NeMo Gym training using ng_prepare_data.
Goal: Validate data format and prepare datasets for training.
Time: ~15 minutes
In this guide, you will:
- Validate datasets with
ng_prepare_data - Generate training and validation splits
- Understand the JSONL data format
Prerequisites:
- NeMo Gym installed (Installation)
policy_base_url,policy_api_key, andpolicy_model_nameset in env.yaml
Quick Start
From the repository root:
Success output:
This generates two types of output:
- Per-dataset metrics:
resources_servers/example_multi_step/data/example_metrics.json(alongside source JSONL) - Aggregated metrics:
data/test/example_metrics.json(in output directory)
Data Format
NeMo Gym uses JSONL files. Each line requires a responses_create_params field following the OpenAI Responses API schema.
Minimal Format
With Verification Fields
Most resources servers add fields for reward computation:
Check resources_servers/<name>/README.md for required fields specific to each resources server.
Key Properties
Message Roles
Preprocess Raw Datasets
If your dataset doesn’t have responses_create_params, you need to preprocess it before using ng_prepare_data.
When to preprocess:
- Downloaded datasets without NeMo Gym format
- Custom data needing system prompts
- Need to split into train/validation sets
Add responses_create_params
The responses_create_params field wraps your input in the Responses API format. This typically includes a system prompt and the user content.
Preprocessing script (preprocess.py)
Save this script as preprocess.py. It reads a raw JSONL file, adds responses_create_params, and splits into train/validation:
You must customize these variables for your dataset:
INPUT_FIELD: The field name containing your input text. Common values:"problem"(math),"question"(QA),"prompt"(general),"instruction"(instruction-following)SYSTEM_PROMPT: Task-specific instructions for the modelTRAIN_RATIO: Train/validation split ratio
Run and verify:
Create Config for Custom Data
After preprocessing, create a config file to point ng_prepare_data at your local files.
Example config: custom_data.yaml
Run data preparation:
This validates your data and adds the agent_ref field to each row, routing samples to your resources server.
Validation Modes
Example Validation
Training Preparation
CLI Parameters
Troubleshooting
Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.
Find invalid samples
Validation Process
ng_prepare_data performs these steps:
- Load configs — Parse server configs, identify datasets
- Check files — Verify dataset files exist
- Validate samples — Parse each line, validate against schema
- Compute metrics — Aggregate statistics
- Collate — Combine samples with agent references
Output Locations
Metrics files are written to two locations:
- Per-dataset:
{dataset_jsonl_path}_metrics.json— alongside each source JSONL file - Aggregated:
{output_dirpath}/{type}_metrics.json— combined metrics per dataset type
Re-Running
- Output files (
train.jsonl,validation.jsonl) are overwritten inoutput_dirpath - Metrics files (
*_metrics.json) are compared — delete them if your data changed
Generated Metrics
Example metrics file
Dataset Configuration
Define datasets in your server’s YAML config: