For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Documentation
    • Home
  • About
    • Concepts
    • Ecosystem
  • Get Started
    • Quickstart
    • Detailed Setup Guide
    • Install from PyPI
    • Rollout Collection
  • Agent Server
  • Model Server
    • vLLM
  • Resources Server
  • Data
    • Prepare and Validate
    • Download from Hugging Face
    • Prompt Config
  • Environment Tutorials
    • Single-Step Environment
    • Multi-Step Environment
    • Stateful Environment
    • Real-World Environment
    • Integrate external libraries
    • Aggregate Metrics
    • LLM-as-Judge Verification
  • Benchmarks
    • Run benchmarks
    • Add a benchmark
    • Design a customer evaluation
  • Training Tutorials
    • NeMo RL
    • Unsloth
    • Multi-Environment Training
    • Offline Training (SFT/DPO)
  • Model Recipes
    • Nemotron 3 Nano
    • Nemotron 3 Super
  • Infrastructure
    • Deployment Topology
    • Engineering Notes
  • Reference
    • Configuration
    • RL Framework Compatibility
    • CLI Commands
    • FAQ
  • Troubleshooting
    • Configuration Errors
  • Contribute
    • Development Setup
    • Environments
    • Integrate RL Frameworks
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Gym
On this page
  • Quick Start
  • Options
  • Download Methods
  • NVIDIA Datasets
  • Troubleshooting
  • Private Repositories
  • Next Steps
Data

Download from Hugging Face

||View as Markdown|
Previous

Prepare and Validate

Next

Prompt Config

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

Goal: Download a dataset from Hugging Face Hub in JSONL format for training.

Prerequisites: NeMo Gym installed (Detailed Setup)


Quick Start

$ng_download_dataset_from_hf \
> +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
> +split=train \
> +output_fpath=./data/train.jsonl
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl

NeMo Gym uses Hydra for configuration. Arguments use +key=value syntax.


Options

OptionDescription
repo_idRequired. Hugging Face repository (e.g., nvidia/Nemotron-RL-math-OpenMathReasoning)
output_dirpathOutput directory. Files named {split}.jsonl. Use this OR output_fpath.
output_fpathExact output file path. Requires split or artifact_fpath. Use this OR output_dirpath.
artifact_fpathDownload a specific file from the repo (raw file mode)
splitDataset split: train, validation, or test. Omit to download all.
hf_tokenAuthentication token for private/gated repositories

Download Methods

Structured Dataset (Recommended)
Raw File
Python Script

Downloads using the datasets library and converts to JSONL.

Use when: Repository uses Hugging Face’s standard dataset format.

All splits:

$ng_download_dataset_from_hf \
> +repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
> +output_dirpath=./data/
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl

Single split:

$ng_download_dataset_from_hf \
> +repo_id=SWE-Gym/SWE-Gym \
> +split=train \
> +output_fpath=./data/train.jsonl

NVIDIA Datasets

Ready-to-use datasets for common training tasks:

DatasetRepositoryDomain
OpenMathReasoningnvidia/Nemotron-RL-math-OpenMathReasoningMath
Competitive Codingnvidia/nemotron-RL-coding-competitive_codingCode
Workplace Assistantnvidia/Nemotron-RL-agent-workplace_assistantAgent
Structured Outputsnvidia/Nemotron-RL-instruction_following-structured_outputsInstruction
MCQAnvidia/Nemotron-RL-knowledge-mcqaKnowledge

Troubleshooting

Authentication Failed (401)
huggingface_hub.utils.HfHubHTTPError: 401 Client Error

Fix: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

Repository Not Found (404)
huggingface_hub.utils.HfHubHTTPError: 404 Client Error

Fix: Check repo_id format is organization/dataset-name. Verify the repository exists and is public (or you have access).

Validation Error: Output Path
ValueError: Either output_dirpath or output_fpath must be provided

Fix: Add +output_dirpath=./data/ or +output_fpath=./data/train.jsonl.

Validation Error: Conflicting Options
ValueError: Cannot specify both artifact_fpath and split

Fix: Use artifact_fpath for raw files OR split for structured datasets—not both.


Private Repositories

Avoid passing tokens on the command line—they appear in shell history.

Recommended — Use environment variable:

$export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
$ng_download_dataset_from_hf \
> +repo_id=my-org/private-dataset \
> +output_dirpath=./data/

Get your token at huggingface.co/settings/tokens. Use a read-only token.

Alternative: Pass token directly

Not recommended for shared systems:

$ng_download_dataset_from_hf \
> +repo_id=my-org/private-dataset \
> +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
> +output_dirpath=./data/
Automatic Downloads During Data Preparation

NeMo Gym can automatically download missing datasets during data preparation. Configure huggingface_identifier in your resources server config:

1datasets:
2 - name: train
3 type: train
4 jsonl_fpath: resources_servers/code_gen/data/train.jsonl
5 huggingface_identifier:
6 repo_id: nvidia/nemotron-RL-coding-competitive_coding
7 artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
8 license: Apache 2.0

Run with download enabled:

$config_paths="resources_servers/code_gen/configs/code_gen.yaml"
$ng_prepare_data "+config_paths=[${config_paths}]" \
> +output_dirpath=./data/prepared \
> +mode=train_preparation \
> +should_download=true \
> +data_source=huggingface

If jsonl_fpath doesn’t exist locally, NeMo Gym downloads from huggingface_identifier before processing.

Caching Behavior

Downloads use Hugging Face’s cache at ~/.cache/huggingface/.

  • Structured datasets: Reads from cache (fast), overwrites output file
  • Raw files: Uses cached copy, then copies to output path

To force fresh download:

$rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
Source References
SectionSource
Config schemanemo_gym/config_types.py:306-349
Download logicnemo_gym/hf_utils.py:57-115
Validation rulesnemo_gym/config_types.py:334-349
Auto-downloadnemo_gym/train_data_utils.py:476-494

Next Steps

Prepare and Validate

Preprocess raw data, run ng_prepare_data, and add agent_ref routing.

Collect Rollouts

Generate training examples by running your agent on prepared data.

Train with NeMo RL

Use validated data with NeMo RL for GRPO training.