Download from Hugging Face#

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Detailed Setup Guide)

Quick Start#

ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
    +split=train \
    +output_fpath=./data/train.jsonl

[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

Note

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.

Options#

Option	Description
`repo_id`	Required. Hugging Face repository (e.g., `nvidia/Nemotron-RL-math-OpenMathReasoning`)
`output_dirpath`	Output directory. Files named `{split}.jsonl`. Use this OR `output_fpath`.
`output_fpath`	Exact output file path. Requires `split` or `artifact_fpath`. Use this OR `output_dirpath`.
`artifact_fpath`	Download a specific file from the repo (raw file mode)
`split`	Dataset split: `train`, `validation`, or `test`. Omit to download all.
`hf_token`	Authentication token for private/gated repositories

Download Methods#

Structured Dataset (Recommended)

Downloads using the datasets library and converts to JSONL.

Use when: Repository uses Hugging Face’s standard dataset format.

All splits:

ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
    +output_dirpath=./data/

[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl

Single split:

ng_download_dataset_from_hf \
    +repo_id=SWE-Gym/SWE-Gym \
    +split=train \
    +output_fpath=./data/train.jsonl

Raw File

Downloads a specific file directly without conversion.

Use when: Repository contains pre-formatted JSONL files.

ng_download_dataset_from_hf \
    +repo_id=nvidia/nemotron-RL-coding-competitive_coding \
    +artifact_fpath=opencodereasoning_filtered_25k_train.jsonl \
    +output_fpath=./data/train.jsonl

[Nemo-Gym] - Downloaded opencodereasoning_filtered_25k_train.jsonl to: ./data/train.jsonl

Python Script

Downloads using the datasets library directly with streaming support.

Use when: You need custom preprocessing, streaming for large datasets, or specific split handling.

import json
from datasets import load_dataset

output_file = "train.jsonl"
dataset_name = "nvidia/OpenMathInstruct-2"
split_name = "train_1M"  # Check dataset page for available splits

with open(output_file, "w", encoding="utf-8") as f:
    for line in load_dataset(dataset_name, split=split_name, streaming=True):
        f.write(json.dumps(line) + "\n")

Run the script:

uv run download.py

Verify the download:

wc -l train.jsonl
# Expected: 1000000 train.jsonl

Streaming benefits:

Memory-efficient for large datasets (millions of rows)
Progress visible during download

Note

For gated or private datasets, authenticate first:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

Or use huggingface-cli login before running the script.

NVIDIA Datasets#

Ready-to-use datasets for common training tasks:

Dataset	Repository	Domain
OpenMathReasoning	`nvidia/Nemotron-RL-math-OpenMathReasoning`	Math
Competitive Coding	`nvidia/nemotron-RL-coding-competitive_coding`	Code
Workplace Assistant	`nvidia/Nemotron-RL-agent-workplace_assistant`	Agent
Structured Outputs	`nvidia/Nemotron-RL-instruction_following-structured_outputs`	Instruction
MCQA	`nvidia/Nemotron-RL-knowledge-mcqa`	Knowledge

Troubleshooting#

Private Repositories#

Warning

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
ng_download_dataset_from_hf \
    +repo_id=my-org/private-dataset \
    +output_dirpath=./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Source References

Section	Source
Config schema	`nemo_gym/config_types.py:306-349`
Download logic	`nemo_gym/hf_utils.py:57-115`
Validation rules	`nemo_gym/config_types.py:334-349`
Auto-download	`nemo_gym/train_data_utils.py:476-494`

Next Steps#

Prepare and Validate

Preprocess raw data, run ng_prepare_data, and add agent_ref routing.

Prepare and Validate Data

Collect Rollouts

Generate training examples by running your agent on prepared data.

Rollout Collection

Train with NeMo RL

Use validated data with NeMo RL for GRPO training.

RL Training with NeMo RL using GRPO