Download JSONL datasets from Hugging Face Hub for NeMo Gym training.
Goal: Download a dataset from Hugging Face Hub in JSONL format for training.
Prerequisites: NeMo Gym installed (Detailed Setup)
NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.
Downloads using the datasets library and converts to JSONL.
Use when: Repository uses Hugging Face’s standard dataset format.
All splits:
Single split:
Ready-to-use datasets for common training tasks:
Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.
Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).
Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.
Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.
Avoid passing tokens on the command line—they appear in shell history.
Recommended — Use environment variable:
Get your token at huggingface.co/settings/tokens. Use a read-only token.
Not recommended for shared systems:
NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:
Run with download enabled:
If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.
Downloads use Hugging Face’s cache at ~/.cache/huggingface/.
To force fresh download: