Prepare Data

View as Markdown

NeMo Gym datasets use JSONL format for reinforcement learning (RL) training. Each dataset connects to an agent server (orchestrates agent-environment interactions) which routes requests to a resources server (provides tools and computes rewards).

Prerequisites

  • NeMo Gym installed: See Installation
  • Repository cloned (for built-in datasets):
    $git clone https://github.com/NVIDIA-NeMo/Gym.git
    $cd Gym

NeMo Gym uses OpenAI-compatible schemas for model server compatibility. No OpenAI account required—local servers like vLLM use the same format.

Data Format

Each JSONL line requires a responses_create_params field following the OpenAI Responses API schema:

1{"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}}

Additional fields like expected_answer vary by resources server—the component that provides tools and reward signals.

Required Fields

FieldAdded ByDescription
responses_create_paramsUserInput to the model during training. Contains input (messages) and optional tools, temperature, etc.
agent_refgym dataset collateRoutes each row to its agent server. Auto-generated during data preparation.

Optional Fields

FieldDescription
expected_answerGround truth for verification (task-specific).
questionOriginal question text (for reference).
idTracking identifier.

Check resources_servers/<name>/README.md for fields required by each resources server’s verify() method.

The agent_ref Field

The agent_ref field maps each row to a specific agent server, which in turn knows its resources server from the YAML config. A training dataset can blend multiple agent servers in a single file—agent_ref tells NeMo Gym which server handles each row.

1{
2 "responses_create_params": {"input": [{"role": "user", "content": "..."}]},
3 "agent_ref": {"type": "responses_api_agents", "name": "math_with_judge_simple_agent"}
4}

You don’t create agent_ref manually. The gym dataset collate tool adds it automatically based on your config file. The tool matches the agent type (responses_api_agents) with the agent name from the config.

Example Data

1{"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}, "expected_answer": "4"}
2{"responses_create_params": {"input": [{"role": "user", "content": "What is 3*5?"}]}, "expected_answer": "15"}
3{"responses_create_params": {"input": [{"role": "user", "content": "What is 10/2?"}]}, "expected_answer": "5"}

Quick Start

Run this command from the repository root:

$gym dataset collate \
> --config responses_api_models/vllm_model/configs/vllm_model_for_training.yaml \
> --resources-server example_multi_step \
> --output-dir data/test \
> --mode example_validation

Success: Finished! message and data/test/example_metrics.json created.

Dataset Types

TypePurposeLicense
exampleTesting and developmentNot required
trainRL training dataRequired
validationEvaluation during trainingRequired

Configuration

Define datasets in your agent server’s YAML config:

1datasets:
2 - name: train
3 type: train
4 jsonl_fpath: resources_servers/workplace_assistant/data/train.jsonl
5 # Unified dataset source. `type` selects the backend; the other fields are backend-specific.
6 source:
7 type: huggingface
8 repo_id: nvidia/Nemotron-RL-agent-workplace_assistant
9 artifact_fpath: train.jsonl
10 license: Apache 2.0
FieldRequiredDescription
nameYesDataset identifier
typeYesexample, train, or validation
jsonl_fpathYesPath to data file
licenseTrain/validationSee valid values below
sourceNoWhere to fetch the data from when it’s missing locally (see below)
num_repeatsNoRepeat count (default: 1)

Dataset source

source is the unified way to declare where a dataset is fetched from. type selects the backend; the remaining fields are backend-specific:

1# Hugging Face Hub
2source:
3 type: huggingface
4 repo_id: nvidia/Nemotron-RL-agent-workplace_assistant
5 artifact_fpath: train.jsonl
6
7# GitLab dataset registry
8source:
9 type: gitlab
10 dataset_name: example_multi_step
11 version: 0.0.1
12 artifact_fpath: train.jsonl

The legacy huggingface_identifier: / gitlab_identifier: blocks still work (a deprecation warning is emitted), so existing configs keep running — but new configs should use source:.

Valid Licenses

Apache 2.0 · MIT · GNU General Public License v3.0 · Creative Commons Attribution 4.0 International · Creative Commons Attribution-ShareAlike 4.0 International · TBD · NVIDIA Internal Use Only, Do Not Distribute

Workflow

Validation Modes

ModeScopeUse Case
example_validationexample datasetsFormat check before contributing
train_preparationtrain + validationFull prep for RL training

To prepare training data with auto-download:

$gym dataset collate \
> --config responses_api_models/vllm_model/configs/vllm_model_for_training.yaml \
> --resources-server workplace_assistant \
> --output-dir data/workplace_assistant \
> --mode train_preparation \
> --download

HuggingFace downloads require authentication. Set hf_token in env.yaml or export HF_TOKEN.

Common Errors

ErrorCauseFix
JSON parse error at line NInvalid JSONCheck quotes, commas, brackets at line N
ValidationError: responses_create_paramsMissing fieldAdd responses_create_params.input
A license is requiredMissing licenseAdd license to dataset config
Missing local datasetsFile not foundCheck path or add --download

Guides

CLI Commands

CommandDescription
gym dataset collateValidate and generate metrics
gym dataset downloadDownload from HuggingFace

See CLI Commands for details.

Large Datasets

  • Validation streams line-by-line (memory-efficient)
  • Single-threaded; >100K samples may take minutes
  • Use num_repeats instead of duplicating JSONL lines