Download from Hugging Face#
Download JSONL datasets from Hugging Face Hub for NeMo Gym training.
Goal: Download a dataset from Hugging Face Hub in JSONL format for training.
Prerequisites: NeMo Gym installed (Detailed Setup Guide)
Quick Start#
ng_download_dataset_from_hf \
+repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
+split=train \
+output_fpath=./data/train.jsonl
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
Note
NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.
Options#
Option |
Description |
|---|---|
|
Required. Hugging Face repository (e.g., |
|
Output directory. Files named |
|
Exact output file path. Requires |
|
Download a specific file from the repo (raw file mode) |
|
Dataset split: |
|
Authentication token for private/gated repositories |
Download Methods#
Downloads using the datasets library and converts to JSONL.
Use when: Repository uses Hugging Face’s standard dataset format.
All splits:
ng_download_dataset_from_hf \
+repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
+output_dirpath=./data/
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl
Single split:
ng_download_dataset_from_hf \
+repo_id=SWE-Gym/SWE-Gym \
+split=train \
+output_fpath=./data/train.jsonl
Downloads a specific file directly without conversion.
Use when: Repository contains pre-formatted JSONL files.
ng_download_dataset_from_hf \
+repo_id=nvidia/nemotron-RL-coding-competitive_coding \
+artifact_fpath=opencodereasoning_filtered_25k_train.jsonl \
+output_fpath=./data/train.jsonl
[Nemo-Gym] - Downloaded opencodereasoning_filtered_25k_train.jsonl to: ./data/train.jsonl
Downloads using the datasets library directly with streaming support.
Use when: You need custom preprocessing, streaming for large datasets, or specific split handling.
import json
from datasets import load_dataset
output_file = "train.jsonl"
dataset_name = "nvidia/OpenMathInstruct-2"
split_name = "train_1M" # Check dataset page for available splits
with open(output_file, "w", encoding="utf-8") as f:
for line in load_dataset(dataset_name, split=split_name, streaming=True):
f.write(json.dumps(line) + "\n")
Run the script:
uv run download.py
Verify the download:
wc -l train.jsonl
# Expected: 1000000 train.jsonl
Streaming benefits:
Memory-efficient for large datasets (millions of rows)
Progress visible during download
Note
For gated or private datasets, authenticate first:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
Or use huggingface-cli login before running the script.
NVIDIA Datasets#
Ready-to-use datasets for common training tasks:
Dataset |
Repository |
Domain |
|---|---|---|
OpenMathReasoning |
|
Math |
Competitive Coding |
|
Code |
Workplace Assistant |
|
Agent |
Structured Outputs |
|
Instruction |
MCQA |
|
Knowledge |
Troubleshooting#
Authentication Failed (401)
huggingface_hub.utils.HfHubHTTPError: 401 Client Error
Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.
Repository Not Found (404)
huggingface_hub.utils.HfHubHTTPError: 404 Client Error
Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).
Validation Error: Output Path
ValueError: Either output_dirpath or output_fpath must be provided
Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.
Validation Error: Conflicting Options
ValueError: Cannot specify both artifact_fpath and split
Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.
Private Repositories#
Warning
Avoid passing tokens on the command line—they appear in shell history.
Recommended — Use environment variable:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
ng_download_dataset_from_hf \
+repo_id=my-org/private-dataset \
+output_dirpath=./data/
Get your token at huggingface.co/settings/tokens. Use a read-only token.
Alternative: Pass token directly
Not recommended for shared systems:
ng_download_dataset_from_hf \
+repo_id=my-org/private-dataset \
+hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
+output_dirpath=./data/
Automatic Downloads During Data Preparation
NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:
datasets:
- name: train
type: train
jsonl_fpath: resources_servers/code_gen/data/train.jsonl
huggingface_identifier:
repo_id: nvidia/nemotron-RL-coding-competitive_coding
artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
license: Apache 2.0
Run with download enabled:
config_paths="resources_servers/code_gen/configs/code_gen.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" \
+output_dirpath=./data/prepared \
+mode=train_preparation \
+should_download=true \
+data_source=huggingface
If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.
Caching Behavior
Downloads use Hugging Face’s cache at ~/.cache/huggingface/.
Structured datasets: Reads from cache (fast), overwrites output file
Raw files: Uses cached copy, then copies to output path
To force fresh download:
rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
Source References
Section |
Source |
|---|---|
Config schema |
|
Download logic |
|
Validation rules |
|
Auto-download |
|
Next Steps#
Preprocess raw data, run ng_prepare_data, and add agent_ref routing.
Generate training examples by running your agent on prepared data.
Use validated data with NeMo RL for GRPO training.