Download from Hugging Face

View as Markdown

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Installation)


Quick Start

$gym dataset download \
> --repo-id nvidia/Nemotron-RL-math-OpenMathReasoning \
> --split train \
> --output ./data/train.jsonl
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.


Options

OptionDescription
repo_idRequired. Hugging Face repository (e.g., nvidia/Nemotron-RL-math-OpenMathReasoning)
output_dirpathOutput directory. Files named {split}.jsonl. Use this OR output_fpath.
output_fpathExact output file path. Requires split or artifact_fpath. Use this OR output_dirpath.
artifact_fpathDownload a specific file from the repo (raw file mode)
splitDataset split: train, validation, or test. Omit to download all.
hf_tokenAuthentication token for private/gated repositories

Download Methods


NVIDIA Datasets

Ready-to-use datasets for common training tasks:

DatasetRepositoryDomain
OpenMathReasoningnvidia/Nemotron-RL-math-OpenMathReasoningMath
Competitive Codingnvidia/nemotron-RL-coding-competitive_codingCode
Workplace Assistantnvidia/Nemotron-RL-agent-workplace_assistantAgent
Structured Outputsnvidia/Nemotron-RL-instruction_following-structured_outputsInstruction
MCQAnvidia/Nemotron-RL-knowledge-mcqaKnowledge

Troubleshooting

huggingface_hub.utils.HfHubHTTPError: 401 Client Error

Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

huggingface_hub.utils.HfHubHTTPError: 404 Client Error

Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).

ValueError: Either output_dirpath or output_fpath must be provided

Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.

ValueError: Cannot specify both artifact_fpath and split

Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.


Private Repositories

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

$export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
$gym dataset download \
> --repo-id my-org/private-dataset \
> --output-dir ./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Not recommended for shared systems:

$gym dataset download \
> --repo-id my-org/private-dataset \
> --output-dir ./data/ \
> +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

NeMo Gym can automatically download missing datasets during data preparation. Declare a source: block (type: huggingface) in your resources server config:

1datasets:
2 - name: train
3 type: train
4 jsonl_fpath: resources_servers/code_gen/data/train.jsonl
5 source:
6 type: huggingface
7 repo_id: nvidia/nemotron-RL-coding-competitive_coding
8 artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
9 license: Apache 2.0

Run with download enabled:

$gym dataset collate \
> --resources-server code_gen \
> --output-dir ./data/prepared \
> --mode train_preparation \
> --download \
> +data_source=huggingface

If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from the dataset’s source: (here, the Hugging Face repo_id) before processing. The legacy huggingface_identifier: block still works but is deprecated in favor of source:.

Downloads use Hugging Face’s cache at ~/.cache/huggingface/.

  • Structured datasets: Reads from cache (fast), overwrites output file
  • Raw files: Uses cached copy, then copies to output path

To force fresh download:

$rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
SectionSource
Config schemanemo_gym/config_types.py:306-349
Download logicnemo_gym/hf_utils.py:57-115
Validation rulesnemo_gym/config_types.py:334-349
Auto-downloadnemo_gym/train_data_utils.py:476-494

Next Steps