Download from Hugging Face

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Detailed Setup)

Quick Start

$ ng_download_dataset_from_hf \
>     +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
>     +split=train \
>     +output_fpath=./data/train.jsonl

[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.

Options

Option	Description
`repo_id`	Required. Hugging Face repository (e.g., `nvidia/Nemotron-RL-math-OpenMathReasoning`)
`output_dirpath`	Output directory. Files named `{split}.jsonl`. Use this OR `output_fpath`.
`output_fpath`	Exact output file path. Requires `split` or `artifact_fpath`. Use this OR `output_dirpath`.
`artifact_fpath`	Download a specific file from the repo (raw file mode)
`split`	Dataset split: `train`, `validation`, or `test`. Omit to download all.
`hf_token`	Authentication token for private/gated repositories

Download Methods

Structured Dataset (Recommended)

Raw File

Python Script

Downloads using the datasets library and converts to JSONL.

Use when: Repository uses Hugging Face’s standard dataset format.

All splits:

$ ng_download_dataset_from_hf \
>     +repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
>     +output_dirpath=./data/

[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl

Single split:

$ ng_download_dataset_from_hf \
>     +repo_id=SWE-Gym/SWE-Gym \
>     +split=train \
>     +output_fpath=./data/train.jsonl

NVIDIA Datasets

Ready-to-use datasets for common training tasks:

Dataset	Repository	Domain
OpenMathReasoning	`nvidia/Nemotron-RL-math-OpenMathReasoning`	Math
Competitive Coding	`nvidia/nemotron-RL-coding-competitive_coding`	Code
Workplace Assistant	`nvidia/Nemotron-RL-agent-workplace_assistant`	Agent
Structured Outputs	`nvidia/Nemotron-RL-instruction_following-structured_outputs`	Instruction
MCQA	`nvidia/Nemotron-RL-knowledge-mcqa`	Knowledge

Troubleshooting

Authentication Failed (401)

huggingface_hub.utils.HfHubHTTPError: 401 Client Error

Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

Repository Not Found (404)

huggingface_hub.utils.HfHubHTTPError: 404 Client Error

Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).

Validation Error: Output Path

ValueError: Either output_dirpath or output_fpath must be provided

Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.

Validation Error: Conflicting Options

ValueError: Cannot specify both artifact_fpath and split

Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.

Private Repositories

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

$ export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
$ ng_download_dataset_from_hf \
>     +repo_id=my-org/private-dataset \
>     +output_dirpath=./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Alternative: Pass token directly

Not recommended for shared systems:

$ ng_download_dataset_from_hf \
>     +repo_id=my-org/private-dataset \
>     +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
>     +output_dirpath=./data/

Automatic Downloads During Data Preparation

NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:

1 datasets:
2   - name: train
3     type: train
4     jsonl_fpath: resources_servers/code_gen/data/train.jsonl
5     huggingface_identifier:
6       repo_id: nvidia/nemotron-RL-coding-competitive_coding
7       artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
8     license: Apache 2.0

Run with download enabled:

$ config_paths="resources_servers/code_gen/configs/code_gen.yaml"
$ ng_prepare_data "+config_paths=[${config_paths}]" \
>     +output_dirpath=./data/prepared \
>     +mode=train_preparation \
>     +should_download=true \
>     +data_source=huggingface

If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.

Caching Behavior

Downloads use Hugging Face’s cache at ~/.cache/huggingface/.

Structured datasets: Reads from cache (fast), overwrites output file
Raw files: Uses cached copy, then copies to output path

To force fresh download:

$ rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>

Source References

Section	Source
Config schema	`nemo_gym/config_types.py:306-349`
Download logic	`nemo_gym/hf_utils.py:57-115`
Validation rules	`nemo_gym/config_types.py:334-349`
Auto-download	`nemo_gym/train_data_utils.py:476-494`

Next Steps

Prepare and Validate

Preprocess raw data, run ng_prepare_data, and add agent_ref routing.

Collect Rollouts

Generate training examples by running your agent on prepared data.

Train with NeMo RL

Use validated data with NeMo RL for GRPO training.

$	ng_download_dataset_from_hf \
>	+repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
>	+split=train \
>	+output_fpath=./data/train.jsonl

$	ng_download_dataset_from_hf \
>	+repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
>	+output_dirpath=./data/

$	ng_download_dataset_from_hf \
>	+repo_id=SWE-Gym/SWE-Gym \
>	+split=train \
>	+output_fpath=./data/train.jsonl

$	export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
$	ng_download_dataset_from_hf \
>	+repo_id=my-org/private-dataset \
>	+output_dirpath=./data/

$	ng_download_dataset_from_hf \
>	+repo_id=my-org/private-dataset \
>	+hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
>	+output_dirpath=./data/

1	datasets:
2	- name: train
3	type: train
4	jsonl_fpath: resources_servers/code_gen/data/train.jsonl
5	huggingface_identifier:
6	repo_id: nvidia/nemotron-RL-coding-competitive_coding
7	artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
8	license: Apache 2.0

$	config_paths="resources_servers/code_gen/configs/code_gen.yaml"
$	ng_prepare_data "+config_paths=[${config_paths}]" \
>	+output_dirpath=./data/prepared \
>	+mode=train_preparation \
>	+should_download=true \
>	+data_source=huggingface