Download from Hugging Face

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Installation)

Quick Start

$ gym dataset download \
>     --repo-id nvidia/Nemotron-RL-math-OpenMathReasoning \
>     --split train \
>     --output ./data/train.jsonl

[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.

Options

Option	Description
`repo_id`	Required. Hugging Face repository (e.g., `nvidia/Nemotron-RL-math-OpenMathReasoning`)
`output_dirpath`	Output directory. Files named `{split}.jsonl`. Use this OR `output_fpath`.
`output_fpath`	Exact output file path. Requires `split` or `artifact_fpath`. Use this OR `output_dirpath`.
`artifact_fpath`	Download a specific file from the repo (raw file mode)
`split`	Dataset split: `train`, `validation`, or `test`. Omit to download all.
`hf_token`	Authentication token for private/gated repositories

Download Methods

Structured Dataset (Recommended)

Raw File

Python Script

Downloads using the datasets library and converts to JSONL.

Use when: Repository uses Hugging Face’s standard dataset format.

All splits:

$ gym dataset download \
>     --repo-id nvidia/Nemotron-RL-knowledge-mcqa \
>     --output-dir ./data/

[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl

Single split:

$ gym dataset download \
>     --repo-id SWE-Gym/SWE-Gym \
>     --split train \
>     --output ./data/train.jsonl

NVIDIA Datasets

Ready-to-use datasets for common training tasks:

Dataset	Repository	Domain
OpenMathReasoning	`nvidia/Nemotron-RL-math-OpenMathReasoning`	Math
Competitive Coding	`nvidia/nemotron-RL-coding-competitive_coding`	Code
Workplace Assistant	`nvidia/Nemotron-RL-agent-workplace_assistant`	Agent
Structured Outputs	`nvidia/Nemotron-RL-instruction_following-structured_outputs`	Instruction
MCQA	`nvidia/Nemotron-RL-knowledge-mcqa`	Knowledge

Troubleshooting

Authentication Failed (401)

huggingface_hub.utils.HfHubHTTPError: 401 Client Error

Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

Repository Not Found (404)

huggingface_hub.utils.HfHubHTTPError: 404 Client Error

Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).

Validation Error: Output Path

ValueError: Either output_dirpath or output_fpath must be provided

Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.

Validation Error: Conflicting Options

ValueError: Cannot specify both artifact_fpath and split

Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.

Private Repositories

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

$ export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
$ gym dataset download \
>     --repo-id my-org/private-dataset \
>     --output-dir ./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Alternative: Pass token directly

Not recommended for shared systems:

$ gym dataset download \
>     --repo-id my-org/private-dataset \
>     --output-dir ./data/ \
>     +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

Automatic Downloads During Data Preparation

NeMo Gym can automatically download missing datasets during data preparation. Declare a source: block (type: huggingface) in your resources server config:

1 datasets:
2   - name: train
3     type: train
4     jsonl_fpath: resources_servers/code_gen/data/train.jsonl
5     source:
6       type: huggingface
7       repo_id: nvidia/nemotron-RL-coding-competitive_coding
8       artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
9     license: Apache 2.0

Run with download enabled:

$ gym dataset collate \
>     --resources-server code_gen \
>     --output-dir ./data/prepared \
>     --mode train_preparation \
>     --download \
>     +data_source=huggingface

If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from the dataset’s source: (here, the Hugging Face repo_id) before processing. The legacy huggingface_identifier: block still works but is deprecated in favor of source:.

Caching Behavior

Downloads use Hugging Face’s cache at ~/.cache/huggingface/.

Structured datasets: Reads from cache (fast), overwrites output file
Raw files: Uses cached copy, then copies to output path

To force fresh download:

$ rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>

Source References

Section	Source
Config schema	`nemo_gym/config_types.py:306-349`
Download logic	`nemo_gym/hf_utils.py:57-115`
Validation rules	`nemo_gym/config_types.py:334-349`
Auto-download	`nemo_gym/train_data_utils.py:476-494`

Next Steps

Prepare and Validate

Preprocess raw data, run gym dataset collate, and add agent_ref routing.

Training Tutorials

Use validated data to train with your preferred RL framework.