> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/gym/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/gym/_mcp/server.

# Prepare and Validate

Format and validate JSONL datasets for NeMo Gym training using `ng_prepare_data`.

**Goal**: Validate data format and prepare datasets for training.

**Time**: \~15 minutes

**In this guide, you will**:

1. Validate datasets with `ng_prepare_data`
2. Generate training and validation splits
3. Understand the JSONL data format

**Prerequisites**:

* NeMo Gym installed ([Installation](/get-started/installation))
* `policy_base_url`, `policy_api_key`, and `policy_model_name` set in env.yaml

***

## Quick Start

From the repository root:

```bash
config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
responses_api_models/openai_model/configs/openai_model.yaml"
ng_prepare_data \
    "+config_paths=[$config_paths]" \
    +output_dirpath=data/test \
    +mode=example_validation
```

Success output:

```text
####################################################################################################
#
# Finished!
#
####################################################################################################
```

This generates two types of output:

* **Per-dataset metrics**: `resources_servers/example_multi_step/data/example_metrics.json` (alongside source JSONL)
* **Aggregated metrics**: `data/test/example_metrics.json` (in output directory)

***

## Data Format

NeMo Gym uses JSONL files. Each line requires a `responses_create_params` field following the [OpenAI Responses API schema](https://platform.openai.com/docs/api-reference/responses/create).

### Minimal Format

```json
{"responses_create_params": {"input": [{"role": "user", "content": "What is 2+2?"}]}}
```

### With Verification Fields

Most resources servers add fields for reward computation:

```json
{
  "responses_create_params": {
    "input": [{"role": "user", "content": "What is 15 * 7? Put your answer in \\boxed{}."}]
  },
  "question": "What is 15 * 7?",
  "expected_answer": "105"
}
```

Check `resources_servers/<name>/README.md` for required fields specific to each resources server.

### Key Properties

| Property              | Type           | Description                                 |
| --------------------- | -------------- | ------------------------------------------- |
| `input`               | string or list | **Required.** User query or message list    |
| `tools`               | list           | Tool definitions for function calling       |
| `parallel_tool_calls` | bool           | Allow parallel tool calls (default: `true`) |
| `temperature`         | float          | Sampling temperature                        |
| `max_output_tokens`   | int            | Maximum response tokens                     |

### Message Roles

| Role        | Use                             |
| ----------- | ------------------------------- |
| `user`      | User queries                    |
| `assistant` | Model responses (multi-turn)    |
| `developer` | System instructions (preferred) |
| `system`    | System instructions (legacy)    |

***

## Preprocess Raw Datasets

If your dataset doesn't have `responses_create_params`, you need to preprocess it before using `ng_prepare_data`.

**When to preprocess**:

* Downloaded datasets without NeMo Gym format
* Custom data needing system prompts
* Need to split into train/validation sets

### Add `responses_create_params`

The `responses_create_params` field wraps your input in the Responses API format. This typically includes a system prompt and the user content.

Save this script as `preprocess.py`. It reads a raw JSONL file, adds `responses_create_params`, and splits into train/validation:

```python
import json
import os

# Configuration — customize these for your dataset
INPUT_FIELD = "problem"  # Field containing the input text (e.g., "problem", "question", "prompt")
FILENAME = "raw_data.jsonl"
SYSTEM_PROMPT = "Your task is to solve a math problem. Put the answer inside \\boxed{}."
TRAIN_RATIO = 0.999  # 99.9% train, 0.1% validation

dirpath = os.path.dirname(FILENAME) or "."
with open(FILENAME, "r", encoding="utf-8") as fin, \
    open(os.path.join(dirpath, "train.jsonl"), "w", encoding="utf-8") as ftrain, \
    open(os.path.join(dirpath, "validation.jsonl"), "w", encoding="utf-8") as fval:
    
    lines = list(fin)
    split_idx = int(len(lines) * TRAIN_RATIO)
    
    for i, line in enumerate(lines):
        if not line.strip():
            continue
        row = json.loads(line)
        
        # Remove fields not needed for training (optional)
        row.pop("generated_solution", None)
        row.pop("problem_source", None)
        
        # Add responses_create_params
        row["responses_create_params"] = {
            "input": [
                {"role": "developer", "content": SYSTEM_PROMPT},
                {"role": "user", "content": row.get(INPUT_FIELD, "")},
            ]
        }
        
        out = json.dumps(row) + "\n"
        (ftrain if i < split_idx else fval).write(out)
```

You must customize these variables for your dataset:

* `INPUT_FIELD`: The field name containing your input text. Common values: `"problem"` (math), `"question"` (QA), `"prompt"` (general), `"instruction"` (instruction-following)
* `SYSTEM_PROMPT`: Task-specific instructions for the model
* `TRAIN_RATIO`: Train/validation split ratio

Run and verify:

```bash
uv run preprocess.py
wc -l train.jsonl validation.jsonl
```

### Create Config for Custom Data

After preprocessing, create a config file to point `ng_prepare_data` at your local files.

```yaml
custom_resources_server:
  resources_servers:
    custom_server:
      entrypoint: app.py
      domain: math  # math | coding | agent | knowledge | other
      description: Custom math dataset
      verified: false

custom_simple_agent:
  responses_api_agents:
    simple_agent:
      entrypoint: app.py
      resources_server:
        type: resources_servers
        name: custom_resources_server
      model_server:
        type: responses_api_models
        name: policy_model
      datasets:
      - name: train
        type: train
        jsonl_fpath: train.jsonl
        license: Creative Commons Attribution 4.0 International
      - name: validation
        type: validation
        jsonl_fpath: validation.jsonl
        license: Creative Commons Attribution 4.0 International
```

Run data preparation:

```bash
config_paths="custom_data.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" +mode=train_preparation +output_dirpath=data
```

This validates your data and adds the `agent_ref` field to each row, routing samples to your resources server.

***

## Validation Modes

| Mode                 | Purpose       | Validates                      |
| -------------------- | ------------- | ------------------------------ |
| `example_validation` | PR submission | `example` datasets             |
| `train_preparation`  | Training prep | `train`, `validation` datasets |

### Example Validation

```bash
ng_prepare_data "+config_paths=[resources_servers/example_multi_step/configs/example_multi_step.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" \
    +output_dirpath=data/example_multi_step \
    +mode=example_validation
```

### Training Preparation

```bash
ng_prepare_data "+config_paths=[resources_servers/workplace_assistant/configs/workplace_assistant.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]" \
    +output_dirpath=data/workplace_assistant \
    +mode=train_preparation \
    +should_download=true
```

### CLI Parameters

| Parameter          | Required | Description                                  |
| ------------------ | -------- | -------------------------------------------- |
| `+config_paths`    | Yes      | YAML config paths                            |
| `+output_dirpath`  | Yes      | Output directory                             |
| `+mode`            | Yes      | `example_validation` or `train_preparation`  |
| `+should_download` | No       | Download missing datasets (default: `false`) |
| `+data_source`     | No       | `huggingface` (default) or `gitlab`          |

***

## Troubleshooting

| Issue                             | Symptom                 | Fix                                               |
| --------------------------------- | ----------------------- | ------------------------------------------------- |
| Missing `responses_create_params` | Sample silently skipped | Add field with valid `input`                      |
| Invalid JSON                      | Sample skipped          | Fix JSON syntax                                   |
| Invalid role                      | Sample skipped          | Use `user`, `assistant`, `system`, or `developer` |
| Missing dataset file              | `AssertionError`        | Create file or set `+should_download=true`        |

Invalid samples are silently skipped. If metrics show fewer examples than expected, check your data format.

```python
import json

def validate_sample(line: str) -> tuple[bool, str]:
    try:
        data = json.loads(line)
    except json.JSONDecodeError as e:
        return False, f"Invalid JSON: {e}"
    
    if "responses_create_params" not in data:
        return False, "Missing 'responses_create_params'"
    
    if "input" not in data["responses_create_params"]:
        return False, "Missing 'input' in responses_create_params"
    
    return True, "OK"

with open("your_data.jsonl") as f:
    for i, line in enumerate(f, 1):
        valid, msg = validate_sample(line)
        if not valid:
            print(f"Line {i}: {msg}")
```

***

## Validation Process

`ng_prepare_data` performs these steps:

1. **Load configs** — Parse server configs, identify datasets
2. **Check files** — Verify dataset files exist
3. **Validate samples** — Parse each line, validate against schema
4. **Compute metrics** — Aggregate statistics
5. **Collate** — Combine samples with agent references

### Output Locations

Metrics files are written to two locations:

* **Per-dataset**: `{dataset_jsonl_path}_metrics.json` — alongside each source JSONL file
* **Aggregated**: `{output_dirpath}/{type}_metrics.json` — combined metrics per dataset type

### Re-Running

* **Output files** (`train.jsonl`, `validation.jsonl`) are overwritten in `output_dirpath`
* **Metrics files** (`*_metrics.json`) are compared — delete them if your data changed

### Generated Metrics

| Metric             | Description                           |
| ------------------ | ------------------------------------- |
| Number of examples | Valid sample count                    |
| Number of tools    | Tool count stats (avg/min/max/stddev) |
| Number of turns    | User messages per sample              |
| Temperature        | Temperature parameter stats           |

```json
{
    "name": "example",
    "type": "example",
    "jsonl_fpath": "resources_servers/example_multi_step/data/example.jsonl",
    "Number of examples": 5,
    "Number of tools": {
        "Total # non-null values": 5,
        "Average": 2.0,
        "Min": 2.0,
        "Max": 2.0
    }
}
```

***

## Dataset Configuration

Define datasets in your server's YAML config:

```yaml
datasets:
  - name: train
    type: train
    jsonl_fpath: resources_servers/my_server/data/train.jsonl
    license: Apache 2.0
  - name: validation
    type: validation
    jsonl_fpath: resources_servers/my_server/data/validation.jsonl
    license: Apache 2.0
  - name: example
    type: example
    jsonl_fpath: resources_servers/my_server/data/example.jsonl
```

| Type         | Purpose                                   | Required for  |
| ------------ | ----------------------------------------- | ------------- |
| `example`    | Small sample (\~5 rows) for format checks | PR submission |
| `train`      | Training data                             | RL training   |
| `validation` | Evaluation during training                | RL training   |

***

## Next Steps

Use validated data for RL training or fine-tuning.