Download from Hugging Face

View as Markdown

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Installation)


Quick Start

$ng_download_dataset_from_hf \
> +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
> +split=train \
> +output_fpath=./data/train.jsonl
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.


Options

OptionDescription
repo_idRequired. Hugging Face repository (e.g., nvidia/Nemotron-RL-math-OpenMathReasoning)
output_dirpathOutput directory. Files named {split}.jsonl. Use this OR output_fpath.
output_fpathExact output file path. Requires split or artifact_fpath. Use this OR output_dirpath.
artifact_fpathDownload a specific file from the repo (raw file mode)
splitDataset split: train, validation, or test. Omit to download all.
hf_tokenAuthentication token for private/gated repositories

Download Methods


NVIDIA Datasets

Ready-to-use datasets for common training tasks:

DatasetRepositoryDomain
OpenMathReasoningnvidia/Nemotron-RL-math-OpenMathReasoningMath
Competitive Codingnvidia/nemotron-RL-coding-competitive_codingCode
Workplace Assistantnvidia/Nemotron-RL-agent-workplace_assistantAgent
Structured Outputsnvidia/Nemotron-RL-instruction_following-structured_outputsInstruction
MCQAnvidia/Nemotron-RL-knowledge-mcqaKnowledge

Troubleshooting

huggingface_hub.utils.HfHubHTTPError: 401 Client Error

Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

huggingface_hub.utils.HfHubHTTPError: 404 Client Error

Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).

ValueError: Either output_dirpath or output_fpath must be provided

Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.

ValueError: Cannot specify both artifact_fpath and split

Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.


Private Repositories

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

$export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
$ng_download_dataset_from_hf \
> +repo_id=my-org/private-dataset \
> +output_dirpath=./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Not recommended for shared systems:

$ng_download_dataset_from_hf \
> +repo_id=my-org/private-dataset \
> +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
> +output_dirpath=./data/

NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:

1datasets:
2 - name: train
3 type: train
4 jsonl_fpath: resources_servers/code_gen/data/train.jsonl
5 huggingface_identifier:
6 repo_id: nvidia/nemotron-RL-coding-competitive_coding
7 artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
8 license: Apache 2.0

Run with download enabled:

$config_paths="resources_servers/code_gen/configs/code_gen.yaml"
$ng_prepare_data "+config_paths=[${config_paths}]" \
> +output_dirpath=./data/prepared \
> +mode=train_preparation \
> +should_download=true \
> +data_source=huggingface

If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.

Downloads use Hugging Face’s cache at ~/.cache/huggingface/.

  • Structured datasets: Reads from cache (fast), overwrites output file
  • Raw files: Uses cached copy, then copies to output path

To force fresh download:

$rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
SectionSource
Config schemanemo_gym/config_types.py:306-349
Download logicnemo_gym/hf_utils.py:57-115
Validation rulesnemo_gym/config_types.py:334-349
Auto-downloadnemo_gym/train_data_utils.py:476-494

Next Steps