Download from Hugging Face#

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Detailed Setup Guide)


Quick Start#

ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
    +split=train \
    +output_fpath=./data/train.jsonl
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

Note

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.


Options#

Option

Description

repo_id

Required. Hugging Face repository (e.g., nvidia/Nemotron-RL-math-OpenMathReasoning)

output_dirpath

Output directory. Files named {split}.jsonl. Use this OR output_fpath.

output_fpath

Exact output file path. Requires split or artifact_fpath. Use this OR output_dirpath.

artifact_fpath

Download a specific file from the repo (raw file mode)

split

Dataset split: train, validation, or test. Omit to download all.

hf_token

Authentication token for private/gated repositories


Download Methods#

Downloads using the datasets library and converts to JSONL.

Use when: Repository uses Hugging Face’s standard dataset format.

All splits:

ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
    +output_dirpath=./data/
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl

Single split:

ng_download_dataset_from_hf \
    +repo_id=SWE-Gym/SWE-Gym \
    +split=train \
    +output_fpath=./data/train.jsonl

Downloads a specific file directly without conversion.

Use when: Repository contains pre-formatted JSONL files.

ng_download_dataset_from_hf \
    +repo_id=nvidia/nemotron-RL-coding-competitive_coding \
    +artifact_fpath=opencodereasoning_filtered_25k_train.jsonl \
    +output_fpath=./data/train.jsonl
[Nemo-Gym] - Downloaded opencodereasoning_filtered_25k_train.jsonl to: ./data/train.jsonl

Downloads using the datasets library directly with streaming support.

Use when: You need custom preprocessing, streaming for large datasets, or specific split handling.

import json
from datasets import load_dataset

output_file = "train.jsonl"
dataset_name = "nvidia/OpenMathInstruct-2"
split_name = "train_1M"  # Check dataset page for available splits

with open(output_file, "w", encoding="utf-8") as f:
    for line in load_dataset(dataset_name, split=split_name, streaming=True):
        f.write(json.dumps(line) + "\n")

Run the script:

uv run download.py

Verify the download:

wc -l train.jsonl
# Expected: 1000000 train.jsonl

Streaming benefits:

  • Memory-efficient for large datasets (millions of rows)

  • Progress visible during download

Note

For gated or private datasets, authenticate first:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx

Or use huggingface-cli login before running the script.


NVIDIA Datasets#

Ready-to-use datasets for common training tasks:

Dataset

Repository

Domain

OpenMathReasoning

nvidia/Nemotron-RL-math-OpenMathReasoning

Math

Competitive Coding

nvidia/nemotron-RL-coding-competitive_coding

Code

Workplace Assistant

nvidia/Nemotron-RL-agent-workplace_assistant

Agent

Structured Outputs

nvidia/Nemotron-RL-instruction_following-structured_outputs

Instruction

MCQA

nvidia/Nemotron-RL-knowledge-mcqa

Knowledge


Troubleshooting#

Authentication Failed (401)
huggingface_hub.utils.HfHubHTTPError: 401 Client Error

Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

Repository Not Found (404)
huggingface_hub.utils.HfHubHTTPError: 404 Client Error

Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).

Validation Error: Output Path
ValueError: Either output_dirpath or output_fpath must be provided

Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.

Validation Error: Conflicting Options
ValueError: Cannot specify both artifact_fpath and split

Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.


Private Repositories#

Warning

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
ng_download_dataset_from_hf \
    +repo_id=my-org/private-dataset \
    +output_dirpath=./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Alternative: Pass token directly

Not recommended for shared systems:

ng_download_dataset_from_hf \
    +repo_id=my-org/private-dataset \
    +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
    +output_dirpath=./data/
Automatic Downloads During Data Preparation

NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:

datasets:
  - name: train
    type: train
    jsonl_fpath: resources_servers/code_gen/data/train.jsonl
    huggingface_identifier:
      repo_id: nvidia/nemotron-RL-coding-competitive_coding
      artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
    license: Apache 2.0

Run with download enabled:

config_paths="resources_servers/code_gen/configs/code_gen.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" \
    +output_dirpath=./data/prepared \
    +mode=train_preparation \
    +should_download=true \
    +data_source=huggingface

If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.

Caching Behavior

Downloads use Hugging Face’s cache at ~/.cache/huggingface/.

  • Structured datasets: Reads from cache (fast), overwrites output file

  • Raw files: Uses cached copy, then copies to output path

To force fresh download:

rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
Source References

Section

Source

Config schema

nemo_gym/config_types.py:306-349

Download logic

nemo_gym/hf_utils.py:57-115

Validation rules

nemo_gym/config_types.py:334-349

Auto-download

nemo_gym/train_data_utils.py:476-494

Next Steps#

Prepare and Validate

Preprocess raw data, run ng_prepare_data, and add agent_ref routing.

Prepare and Validate Data
Collect Rollouts

Generate training examples by running your agent on prepared data.

Rollout Collection
Train with NeMo RL

Use validated data with NeMo RL for GRPO training.

RL Training with NeMo RL using GRPO