Download from Hugging Face
Download JSONL datasets from Hugging Face Hub for NeMo Gym training.
Goal: Download a dataset from Hugging Face Hub in JSONL format for training.
Prerequisites: NeMo Gym installed (Installation)
Quick Start
NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.
Options
Download Methods
Structured Dataset (Recommended)
Raw File
Python Script
Downloads using the datasets library and converts to JSONL.
Use when: Repository uses Hugging Face’s standard dataset format.
All splits:
Single split:
NVIDIA Datasets
Ready-to-use datasets for common training tasks:
Troubleshooting
Authentication Failed (401)
Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.
Repository Not Found (404)
Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).
Validation Error: Output Path
Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.
Validation Error: Conflicting Options
Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.
Private Repositories
Avoid passing tokens on the command line—they appear in shell history.
Recommended — Use environment variable:
Get your token at huggingface.co/settings/tokens. Use a read-only token.
Alternative: Pass token directly
Not recommended for shared systems:
Automatic Downloads During Data Preparation
NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:
Run with download enabled:
If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.
Caching Behavior
Downloads use Hugging Face’s cache at ~/.cache/huggingface/.
- Structured datasets: Reads from cache (fast), overwrites output file
- Raw files: Uses cached copy, then copies to output path
To force fresh download: