Prepare and Validate

View as Markdown

Format and validate JSONL datasets for NeMo Gym training using ng_prepare_data.

Goal: Validate data format and prepare datasets for training.

Time: ~15 minutes

In this guide, you will:

  1. Validate datasets with ng_prepare_data
  2. Generate training and validation splits
  3. Understand the JSONL data format

Prerequisites:

  • NeMo Gym installed (Installation)
  • policy_base_url, policy_api_key, and policy_model_name set in env.yaml

Quick Start

From the repository root:

$config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
>responses_api_models/openai_model/configs/openai_model.yaml"
$ng_prepare_data \
> "+config_paths=[$config_paths]" \
> +output_dirpath=data/test \
> +mode=example_validation

Success output:

####################################################################################################
#
# Finished!
#
####################################################################################################

This generates two types of output:

  • Per-dataset metrics: resources_servers/example_multi_step/data/example_metrics.json (alongside source JSONL)
  • Aggregated metrics: data/test/example_metrics.json (in output directory)

Data Format

NeMo Gym uses JSONL files. Each line requires a responses_create_params field following the OpenAI Responses API schema.

Minimal Format

1{"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}}

With Verification Fields

Most resources servers add fields for reward computation:

1{
2 "responses_create_params": {
3 "input": [{"role": "user", "content": "What is 15 * 7? Put your answer in \\boxed{}."}]
4 },
5 "question": "What is 15 * 7?",
6 "expected_answer": "105"
7}

Check resources_servers/<name>/README.md for required fields specific to each resources server.

Key Properties

PropertyTypeDescription
inputstring or listRequired. User query or message list
toolslistTool definitions for function calling
parallel_tool_callsboolAllow parallel tool calls (default: true)
temperaturefloatSampling temperature
max_output_tokensintMaximum response tokens

Message Roles

RoleUse
userUser queries
assistantModel responses (multi-turn)
developerSystem instructions (preferred)
systemSystem instructions (legacy)

Preprocess Raw Datasets

If your dataset doesn’t have responses_create_params, you need to preprocess it before using ng_prepare_data.

When to preprocess:

  • Downloaded datasets without NeMo Gym format
  • Custom data needing system prompts
  • Need to split into train/validation sets

Add responses_create_params

The responses_create_params field wraps your input in the Responses API format. This typically includes a system prompt and the user content.

Save this script as preprocess.py. It reads a raw JSONL file, adds responses_create_params, and splits into train/validation:

1import json
2import os
3
4# Configuration — customize these for your dataset
5INPUT_FIELD = "problem" # Field containing the input text (e.g., "problem", "question", "prompt")
6FILENAME = "raw_data.jsonl"
7SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
8TRAIN_RATIO = 0.999 # 99.9% train, 0.1% validation
9
10dirpath = os.path.dirname(FILENAME) or "."
11with open(FILENAME, "r", encoding="utf-8") as fin, \
12 open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
13 open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:
14
15 lines = list(fin)
16 split_idx = int(len(lines) * TRAIN_RATIO)
17
18 for i, line in enumerate(lines):
19 if not line.strip():
20 continue
21 row = json.loads(line)
22
23 # Remove fields not needed for training (optional)
24 row.pop("generated_solution", None)
25 row.pop("problem_source", None)
26
27 # Add responses_create_params
28 row["responses_create_params"] = {
29 "input": [
30 {"role": "developer", "content": SYSTEM_PROMPT},
31 {"role": "user", "content": row.get(INPUT_FIELD, "")},
32 ]
33 }
34
35 out = json.dumps(row) + "\n"
36 (ftrain if i < split_idx else fval).write(out)

You must customize these variables for your dataset:

  • INPUT_FIELD: The field name containing your input text. Common values: "problem" (math), "question" (QA), "prompt" (general), "instruction" (instruction-following)
  • SYSTEM_PROMPT: Task-specific instructions for the model
  • TRAIN_RATIO: Train/validation split ratio

Run and verify:

$uv run preprocess.py
$wc -l train.jsonl validation.jsonl

Create Config for Custom Data

After preprocessing, create a config file to point ng_prepare_data at your local files.

1custom_resources_server:
2 resources_servers:
3 custom_server:
4 entrypoint: app.py
5 domain: math # math | coding | agent | knowledge | other
6 description: Custom math dataset
7 verified: false
8
9custom_simple_agent:
10 responses_api_agents:
11 simple_agent:
12 entrypoint: app.py
13 resources_server:
14 type: resources_servers
15 name: custom_resources_server
16 model_server:
17 type: responses_api_models
18 name: policy_model
19 datasets:
20 - name: train
21 type: train
22 jsonl_fpath: train.jsonl
23 license: Creative Commons Attribution 4.0 International
24 - name: validation
25 type: validation
26 jsonl_fpath: validation.jsonl
27 license: Creative Commons Attribution 4.0 International

Run data preparation:

$config_paths="custom_data.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml"
$ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data

This validates your data and adds the agent_ref field to each row, routing samples to your resources server.


Validation Modes

ModePurposeValidates
example_validationPR submissionexample datasets
train_preparationTraining preptrain, validation datasets

Example Validation

$ng_prepare_data "+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" \
> +output_dirpath=data/example_multi_step \
> +mode=example_validation

Training Preparation

$ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/workplace_assistant.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" \
> +output_dirpath=data/workplace_assistant \
> +mode=train_preparation \
> +should_download=true

CLI Parameters

ParameterRequiredDescription
+config_pathsYesYAML config paths
+output_dirpathYesOutput directory
+modeYesexample_validation or train_preparation
+should_downloadNoDownload missing datasets (default: false)
+data_sourceNohuggingface (default) or gitlab

Troubleshooting

IssueSymptomFix
Missing responses_create_paramsSample silently skippedAdd field with valid input
Invalid JSONSample skippedFix JSON syntax
Invalid roleSample skippedUse user, assistant, system, or developer
Missing dataset fileAssertionErrorCreate file or set +should_download=true

Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.

1import json
2
3def validate_sample(line: str) -> tuple[bool, str]:
4 try:
5 data = json.loads(line)
6 except json.JSONDecodeError as e:
7 return False, f"Invalid JSON: {e}"
8
9 if "responses_create_params" not in data:
10 return False, "Missing 'responses_create_params'"
11
12 if "input" not in data["responses_create_params"]:
13 return False, "Missing 'input' in responses_create_params"
14
15 return True, "OK"
16
17with open("your_data.jsonl") as f:
18 for i, line in enumerate(f, 1):
19 valid, msg = validate_sample(line)
20 if not valid:
21 print(f"Line {i}: {msg}")

Validation Process

ng_prepare_data performs these steps:

  1. Load configs — Parse server configs, identify datasets
  2. Check files — Verify dataset files exist
  3. Validate samples — Parse each line, validate against schema
  4. Compute metrics — Aggregate statistics
  5. Collate — Combine samples with agent references

Output Locations

Metrics files are written to two locations:

  • Per-dataset: {dataset_jsonl_path}_metrics.json — alongside each source JSONL file
  • Aggregated: {output_dirpath}/{type}_metrics.json — combined metrics per dataset type

Re-Running

  • Output files (train.jsonl, validation.jsonl) are overwritten in output_dirpath
  • Metrics files (*_metrics.json) are compared — delete them if your data changed

Generated Metrics

MetricDescription
Number of examplesValid sample count
Number of toolsTool count stats (avg/min/max/stddev)
Number of turnsUser messages per sample
TemperatureTemperature parameter stats
1{
2 "name": "example",
3 "type": "example",
4 "jsonl_fpath": "resources_servers/example_multi_step/data/example.jsonl",
5 "Number of examples": 5,
6 "Number of tools": {
7 "Total # non-null values": 5,
8 "Average": 2.0,
9 "Min": 2.0,
10 "Max": 2.0
11 }
12}

Dataset Configuration

Define datasets in your server’s YAML config:

1datasets:
2 - name: train
3 type: train
4 jsonl_fpath: resources_servers/my_server/data/train.jsonl
5 license: Apache 2.0
6 - name: validation
7 type: validation
8 jsonl_fpath: resources_servers/my_server/data/validation.jsonl
9 license: Apache 2.0
10 - name: example
11 type: example
12 jsonl_fpath: resources_servers/my_server/data/example.jsonl
TypePurposeRequired for
exampleSmall sample (~5 rows) for format checksPR submission
trainTraining dataRL training
validationEvaluation during trainingRL training

Next Steps