Prepare and Validate

Format and validate JSONL datasets for NeMo Gym training using ng_prepare_data.

Goal: Validate data format and prepare datasets for training.

Time: ~15 minutes

In this guide, you will:

Validate datasets with ng_prepare_data
Generate training and validation splits
Understand the JSONL data format

Prerequisites:

NeMo Gym installed (Installation)
policy_base_url, policy_api_key, and policy_model_name set in env.yaml

Quick Start

From the repository root:

$ config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
> responses_api_models/openai_model/configs/openai_model.yaml"
$ ng_prepare_data \
>     "+config_paths=[$config_paths]" \
>     +output_dirpath=data/test \
>     +mode=example_validation

Success output:

####################################################################################################
#
# Finished!
#
####################################################################################################

This generates two types of output:

Per-dataset metrics: resources_servers/example_multi_step/data/example_metrics.json (alongside source JSONL)
Aggregated metrics: data/test/example_metrics.json (in output directory)

Data Format

NeMo Gym uses JSONL files. Each line requires a responses_create_params field following the OpenAI Responses API schema.

Minimal Format

1 {"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}}

With Verification Fields

Most resources servers add fields for reward computation:

1 {
2   "responses_create_params": {
3     "input": [{"role": "user", "content": "What is 15 * 7? Put your answer in \\boxed{}."}]
4   },
5   "question": "What is 15 * 7?",
6   "expected_answer": "105"
7 }

Check resources_servers/<name>/README.md for required fields specific to each resources server.

Key Properties

Property	Type	Description
`input`	string or list	Required. User query or message list
`tools`	list	Tool definitions for function calling
`parallel_tool_calls`	bool	Allow parallel tool calls (default: `true`)
`temperature`	float	Sampling temperature
`max_output_tokens`	int	Maximum response tokens

Message Roles

Role	Use
`user`	User queries
`assistant`	Model responses (multi-turn)
`developer`	System instructions (preferred)
`system`	System instructions (legacy)

Preprocess Raw Datasets

If your dataset doesn’t have responses_create_params, you need to preprocess it before using ng_prepare_data.

When to preprocess:

Downloaded datasets without NeMo Gym format
Custom data needing system prompts
Need to split into train/validation sets

Add `responses_create_params`

The responses_create_params field wraps your input in the Responses API format. This typically includes a system prompt and the user content.

Preprocessing script (preprocess.py)

Save this script as preprocess.py. It reads a raw JSONL file, adds responses_create_params, and splits into train/validation:

1 import json
2 import os
3 
4 # Configuration — customize these for your dataset
5 INPUT_FIELD = "problem"  # Field containing the input text (e.g., "problem", "question", "prompt")
6 FILENAME = "raw_data.jsonl"
7 SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
8 TRAIN_RATIO = 0.999  # 99.9% train, 0.1% validation
9 
10 dirpath = os.path.dirname(FILENAME) or "."
11 with open(FILENAME, "r", encoding="utf-8") as fin, \
12     open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
13     open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:
14     
15     lines = list(fin)
16     split_idx = int(len(lines) * TRAIN_RATIO)
17     
18     for i, line in enumerate(lines):
19         if not line.strip():
20             continue
21         row = json.loads(line)
22         
23         # Remove fields not needed for training (optional)
24         row.pop("generated_solution", None)
25         row.pop("problem_source", None)
26         
27         # Add responses_create_params
28         row["responses_create_params"] = {
29             "input": [
30                 {"role": "developer", "content": SYSTEM_PROMPT},
31                 {"role": "user", "content": row.get(INPUT_FIELD, "")},
32             ]
33         }
34         
35         out = json.dumps(row) + "\n"
36         (ftrain if i < split_idx else fval).write(out)

You must customize these variables for your dataset:

INPUT_FIELD: The field name containing your input text. Common values: "problem" (math), "question" (QA), "prompt" (general), "instruction" (instruction-following)
SYSTEM_PROMPT: Task-specific instructions for the model
TRAIN_RATIO: Train/validation split ratio

Run and verify:

$ uv run preprocess.py
$ wc -l train.jsonl validation.jsonl

Create Config for Custom Data

After preprocessing, create a config file to point ng_prepare_data at your local files.

Example config: custom_data.yaml

1 custom_resources_server:
2   resources_servers:
3     custom_server:
4       entrypoint: app.py
5       domain: math  # math | coding | agent | knowledge | other
6       description: Custom math dataset
7       verified: false
8 
9 custom_simple_agent:
10   responses_api_agents:
11     simple_agent:
12       entrypoint: app.py
13       resources_server:
14         type: resources_servers
15         name: custom_resources_server
16       model_server:
17         type: responses_api_models
18         name: policy_model
19       datasets:
20       - name: train
21         type: train
22         jsonl_fpath: train.jsonl
23         license: Creative Commons Attribution 4.0 International
24       - name: validation
25         type: validation
26         jsonl_fpath: validation.jsonl
27         license: Creative Commons Attribution 4.0 International

Run data preparation:

$ config_paths="custom_data.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml"
$ ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data

This validates your data and adds the agent_ref field to each row, routing samples to your resources server.

Validation Modes

Mode	Purpose	Validates
`example_validation`	PR submission	`example` datasets
`train_preparation`	Training prep	`train`, `validation` datasets

Example Validation

$ ng_prepare_data "+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" \
>     +output_dirpath=data/example_multi_step \
>     +mode=example_validation

Training Preparation

$ ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/workplace_assistant.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" \
>     +output_dirpath=data/workplace_assistant \
>     +mode=train_preparation \
>     +should_download=true

CLI Parameters

Parameter	Required	Description
`+config_paths`	Yes	YAML config paths
`+output_dirpath`	Yes	Output directory
`+mode`	Yes	`example_validation` or `train_preparation`
`+should_download`	No	Download missing datasets (default: `false`)
`+data_source`	No	`huggingface` (default) or `gitlab`

Troubleshooting

Issue	Symptom	Fix
Missing `responses_create_params`	Sample silently skipped	Add field with valid `input`
Invalid JSON	Sample skipped	Fix JSON syntax
Invalid role	Sample skipped	Use `user`, `assistant`, `system`, or `developer`
Missing dataset file	`AssertionError`	Create file or set `+should_download=true`

Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.

Find invalid samples

1 import json
2 
3 def validate_sample(line: str) -> tuple[bool, str]:
4     try:
5         data = json.loads(line)
6     except json.JSONDecodeError as e:
7         return False, f"Invalid JSON: {e}"
8     
9     if "responses_create_params" not in data:
10         return False, "Missing 'responses_create_params'"
11     
12     if "input" not in data["responses_create_params"]:
13         return False, "Missing 'input' in responses_create_params"
14     
15     return True, "OK"
16 
17 with open("your_data.jsonl") as f:
18     for i, line in enumerate(f, 1):
19         valid, msg = validate_sample(line)
20         if not valid:
21             print(f"Line {i}: {msg}")

Validation Process

ng_prepare_data performs these steps:

Load configs — Parse server configs, identify datasets
Check files — Verify dataset files exist
Validate samples — Parse each line, validate against schema
Compute metrics — Aggregate statistics
Collate — Combine samples with agent references

Output Locations

Metrics files are written to two locations:

Per-dataset: {dataset_jsonl_path}_metrics.json — alongside each source JSONL file
Aggregated: {output_dirpath}/{type}_metrics.json — combined metrics per dataset type

Re-Running

Output files (train.jsonl, validation.jsonl) are overwritten in output_dirpath
Metrics files (*_metrics.json) are compared — delete them if your data changed

Generated Metrics

Metric	Description
Number of examples	Valid sample count
Number of tools	Tool count stats (avg/min/max/stddev)
Number of turns	User messages per sample
Temperature	Temperature parameter stats

Example metrics file

1 {
2     "name": "example",
3     "type": "example",
4     "jsonl_fpath": "resources_servers/example_multi_step/data/example.jsonl",
5     "Number of examples": 5,
6     "Number of tools": {
7         "Total # non-null values": 5,
8         "Average": 2.0,
9         "Min": 2.0,
10         "Max": 2.0
11     }
12 }

Dataset Configuration

Define datasets in your server’s YAML config:

1 datasets:
2   - name: train
3     type: train
4     jsonl_fpath: resources_servers/my_server/data/train.jsonl
5     license: Apache 2.0
6   - name: validation
7     type: validation
8     jsonl_fpath: resources_servers/my_server/data/validation.jsonl
9     license: Apache 2.0
10   - name: example
11     type: example
12     jsonl_fpath: resources_servers/my_server/data/example.jsonl

Type	Purpose	Required for
`example`	Small sample (~5 rows) for format checks	PR submission
`train`	Training data	RL training
`validation`	Evaluation during training	RL training

Next Steps

NeMo RL Integration

Use validated data with NeMo RL for GRPO training.