Prepare and Validate Data#

Format and validate JSONL datasets for NeMo Gym training using ng_prepare_data.

Goal: Validate data format and prepare datasets for training.

Time: ~15 minutes

In this guide, you will:

Validate datasets with ng_prepare_data
Generate training and validation splits
Understand the JSONL data format

Prerequisites:

NeMo Gym installed (Detailed Setup Guide)

Quick Start#

From the repository root:

ng_prepare_data \
    "+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml]" \
    +output_dirpath=data/test \
    +mode=example_validation

Success output:

####################################################################################################
#
# Finished!
#
####################################################################################################

This generates two types of output:

Per-dataset metrics: resources_servers/example_multi_step/data/example_metrics.json (alongside source JSONL)
Aggregated metrics: data/test/example_metrics.json (in output directory)

Data Format#

NeMo Gym uses JSONL files. Each line requires a responses_create_params field following the OpenAI Responses API schema.

Minimal Format#

{"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}}

With Verification Fields#

Most resources servers add fields for reward computation:

{
  "responses_create_params": {
    "input": [{"role": "user", "content": "What is 15 * 7? Put your answer in \\boxed{}."}]
  },
  "question": "What is 15 * 7?",
  "expected_answer": "105"
}

Tip

Check resources_servers/<name>/README.md for required fields specific to each resources server.

Key Properties#

Property	Type	Description
`input`	string or list	Required. User query or message list
`tools`	list	Tool definitions for function calling
`parallel_tool_calls`	bool	Allow parallel tool calls (default: `true`)
`temperature`	float	Sampling temperature
`max_output_tokens`	int	Maximum response tokens

Message Roles#

Role	Use
`user`	User queries
`assistant`	Model responses (multi-turn)
`developer`	System instructions (preferred)
`system`	System instructions (legacy)

Preprocess Raw Datasets#

If your dataset doesn’t have responses_create_params, you need to preprocess it before using ng_prepare_data.

When to preprocess:

Downloaded datasets without NeMo Gym format
Custom data needing system prompts
Need to split into train/validation sets

Add `responses_create_params`#

The responses_create_params field wraps your input in the Responses API format. This typically includes a system prompt and the user content.

Run and verify:

uv run preprocess.py
wc -l train.jsonl validation.jsonl

Create Config for Custom Data#

After preprocessing, create a config file to point ng_prepare_data at your local files.

Run data preparation:

config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,custom_data.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data

This validates your data and adds the agent_ref field to each row, routing samples to your resource server.

Validation Modes#

Mode	Purpose	Validates
`example_validation`	PR submission	`example` datasets
`train_preparation`	Training prep	`train`, `validation` datasets

Example Validation#

ng_prepare_data "+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml]" \
    +output_dirpath=data/example_multi_step \
    +mode=example_validation

Training Preparation#

ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/workplace_assistant.yaml]" \
    +output_dirpath=data/workplace_assistant \
    +mode=train_preparation \
    +should_download=true

CLI Parameters#

Parameter	Required	Description
`+config_paths`	Yes	YAML config paths
`+output_dirpath`	Yes	Output directory
`+mode`	Yes	`example_validation` or `train_preparation`
`+should_download`	No	Download missing datasets (default: `false`)
`+data_source`	No	`huggingface` (default) or `gitlab`

Troubleshooting#

Issue	Symptom	Fix
Missing `responses_create_params`	Sample silently skipped	Add field with valid `input`
Invalid JSON	Sample skipped	Fix JSON syntax
Invalid role	Sample skipped	Use `user`, `assistant`, `system`, or `developer`
Missing dataset file	`AssertionError`	Create file or set `+should_download=true`

Warning

Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.

Validation Process#

ng_prepare_data performs these steps:

Load configs — Parse server configs, identify datasets
Check files — Verify dataset files exist
Validate samples — Parse each line, validate against schema
Compute metrics — Aggregate statistics
Collate — Combine samples with agent references

Output Locations#

Metrics files are written to two locations:

Per-dataset: {dataset_jsonl_path}_metrics.json — alongside each source JSONL file
Aggregated: {output_dirpath}/{type}_metrics.json — combined metrics per dataset type

Re-Running#

Output files (train.jsonl, validation.jsonl) are overwritten in output_dirpath
Metrics files (*_metrics.json) are compared — delete them if your data changed

Generated Metrics#

Metric	Description
Number of examples	Valid sample count
Number of tools	Tool count stats (avg/min/max/stddev)
Number of turns	User messages per sample
Temperature	Temperature parameter stats

Dataset Configuration#

Define datasets in your server’s YAML config:

datasets:
  - name: train
    type: train
    jsonl_fpath: resources_servers/my_server/data/train.jsonl
    license: Apache 2.0
  - name: validation
    type: validation
    jsonl_fpath: resources_servers/my_server/data/validation.jsonl
    license: Apache 2.0
  - name: example
    type: example
    jsonl_fpath: resources_servers/my_server/data/example.jsonl

Type	Purpose	Required for
`example`	Small sample (~5 rows) for format checks	PR submission
`train`	Training data	RL training
`validation`	Evaluation during training	RL training

Next Steps#

Collect Rollouts

Generate training examples by running your agent on prepared data.

Rollout Collection

NeMo RL Integration

Use validated data with NeMo RL for GRPO training.

RL Training with NeMo RL using GRPO