Format and validate JSONL datasets for NeMo Gym training using ng_prepare_data.
Goal: Validate data format and prepare datasets for training.
Time: ~15 minutes
In this guide, you will:
ng_prepare_dataPrerequisites:
policy_base_url, policy_api_key, and policy_model_name set in env.yamlFrom the repository root:
Success output:
This generates two types of output:
resources_servers/example_multi_step/data/example_metrics.json (alongside source JSONL)data/test/example_metrics.json (in output directory)NeMo Gym uses JSONL files. Each line requires a responses_create_params field following the OpenAI Responses API schema.
Most resources servers add fields for reward computation:
Check resources_servers/<name>/README.md for required fields specific to each resources server.
If your dataset doesn’t have responses_create_params, you need to preprocess it before using ng_prepare_data.
When to preprocess:
responses_create_paramsThe responses_create_params field wraps your input in the Responses API format. This typically includes a system prompt and the user content.
Save this script as preprocess.py. It reads a raw JSONL file, adds responses_create_params, and splits into train/validation:
You must customize these variables for your dataset:
INPUT_FIELD: The field name containing your input text. Common values: "problem" (math), "question" (QA), "prompt" (general), "instruction" (instruction-following)SYSTEM_PROMPT: Task-specific instructions for the modelTRAIN_RATIO: Train/validation split ratioRun and verify:
After preprocessing, create a config file to point ng_prepare_data at your local files.
Run data preparation:
This validates your data and adds the agent_ref field to each row, routing samples to your resources server.
Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.
ng_prepare_data performs these steps:
Metrics files are written to two locations:
{dataset_jsonl_path}_metrics.json — alongside each source JSONL file{output_dirpath}/{type}_metrics.json — combined metrics per dataset typetrain.jsonl, validation.jsonl) are overwritten in output_dirpath*_metrics.json) are compared — delete them if your data changedDefine datasets in your server’s YAML config: