> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/gym/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/gym/_mcp/server.

# Download from Hugging Face

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

**Goal**: Download a dataset from Hugging Face Hub in JSONL format for training.

**Prerequisites**: NeMo Gym installed ([Installation](/get-started/installation))

***

## Quick Start

```bash
ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
    +split=train \
    +output_fpath=./data/train.jsonl
```

```text
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
```

NeMo Gym uses [Hydra](https://hydra.cc/) for configuration. Arguments use `+key=value` syntax.

***

## Options

| Option           | Description                                                                                     |
| ---------------- | ----------------------------------------------------------------------------------------------- |
| `repo_id`        | **Required.** Hugging Face repository (e.g., `nvidia/Nemotron-RL-math-OpenMathReasoning`)       |
| `output_dirpath` | Output directory. Files named `{split}.jsonl`. **Use this OR `output_fpath`.**                  |
| `output_fpath`   | Exact output file path. Requires `split` or `artifact_fpath`. **Use this OR `output_dirpath`.** |
| `artifact_fpath` | Download a specific file from the repo (raw file mode)                                          |
| `split`          | Dataset split: `train`, `validation`, or `test`. Omit to download all.                          |
| `hf_token`       | Authentication token for private/gated repositories                                             |

***

## Download Methods

Downloads using the `datasets` library and converts to JSONL.

**Use when**: Repository uses Hugging Face's standard dataset format.

**All splits**:

```bash
ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
    +output_dirpath=./data/
```

```text
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
[Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl
```

**Single split**:

```bash
ng_download_dataset_from_hf \
    +repo_id=SWE-Gym/SWE-Gym \
    +split=train \
    +output_fpath=./data/train.jsonl
```

Downloads a specific file directly without conversion.

**Use when**: Repository contains pre-formatted JSONL files.

```bash
ng_download_dataset_from_hf \
    +repo_id=nvidia/nemotron-RL-coding-competitive_coding \
    +artifact_fpath=opencodereasoning_filtered_25k_train.jsonl \
    +output_fpath=./data/train.jsonl
```

```text
[Nemo-Gym] - Downloaded opencodereasoning_filtered_25k_train.jsonl to: ./data/train.jsonl
```

Downloads using the `datasets` library directly with streaming support.

**Use when**: You need custom preprocessing, streaming for large datasets, or specific split handling.

```python
import json
from datasets import load_dataset

output_file = "train.jsonl"
dataset_name = "nvidia/OpenMathInstruct-2"
split_name = "train_1M"  # Check dataset page for available splits

with open(output_file, "w", encoding="utf-8") as f:
    for line in load_dataset(dataset_name, split=split_name, streaming=True):
        f.write(json.dumps(line) + "\n")
```

Run the script:

```bash
uv run download.py
```

Verify the download:

```bash
wc -l train.jsonl
# Expected: 1000000 train.jsonl
```

**Streaming benefits**:

* Memory-efficient for large datasets (millions of rows)
* Progress visible during download

For gated or private datasets, authenticate first:

```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
```

Or use `huggingface-cli login` before running the script.

***

## NVIDIA Datasets

Ready-to-use datasets for common training tasks:

| Dataset             | Repository                                                    | Domain      |
| ------------------- | ------------------------------------------------------------- | ----------- |
| OpenMathReasoning   | `nvidia/Nemotron-RL-math-OpenMathReasoning`                   | Math        |
| Competitive Coding  | `nvidia/nemotron-RL-coding-competitive_coding`                | Code        |
| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant`                | Agent       |
| Structured Outputs  | `nvidia/Nemotron-RL-instruction_following-structured_outputs` | Instruction |
| MCQA                | `nvidia/Nemotron-RL-knowledge-mcqa`                           | Knowledge   |

***

## Troubleshooting

```text
huggingface_hub.utils.HfHubHTTPError: 401 Client Error
```

**Fix**: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.

```text
huggingface_hub.utils.HfHubHTTPError: 404 Client Error
```

**Fix**: Check `repo_id` format is `organization/dataset-name`. Verify the repository exists and is public (or you have access).

```text
ValueError: Either output_dirpath or output_fpath must be provided
```

**Fix**: Add `+output_dirpath=./data/` or `+output_fpath=./data/train.jsonl`.

```text
ValueError: Cannot specify both artifact_fpath and split
```

**Fix**: Use `artifact_fpath` for raw files OR `split` for structured datasets—not both.

***

## Private Repositories

Avoid passing tokens on the command line—they appear in shell history.

**Recommended** — Use environment variable:

```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
ng_download_dataset_from_hf \
    +repo_id=my-org/private-dataset \
    +output_dirpath=./data/
```

Get your token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Use a **read-only** token.

Not recommended for shared systems:

```bash
ng_download_dataset_from_hf \
    +repo_id=my-org/private-dataset \
    +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
    +output_dirpath=./data/
```

NeMo Gym can automatically download missing datasets during data preparation. Configure `huggingface_identifier` in your resources server config:

```yaml
datasets:
  - name: train
    type: train
    jsonl_fpath: resources_servers/code_gen/data/train.jsonl
    huggingface_identifier:
      repo_id: nvidia/nemotron-RL-coding-competitive_coding
      artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
    license: Apache 2.0
```

Run with download enabled:

```bash
config_paths="resources_servers/code_gen/configs/code_gen.yaml"
ng_prepare_data "+config_paths=[${config_paths}]" \
    +output_dirpath=./data/prepared \
    +mode=train_preparation \
    +should_download=true \
    +data_source=huggingface
```

If `jsonl_fpath` doesn't exist locally, NeMo Gym downloads from `huggingface_identifier` before processing.

Downloads use Hugging Face's cache at `~/.cache/huggingface/`.

* **Structured datasets**: Reads from cache (fast), overwrites output file
* **Raw files**: Uses cached copy, then copies to output path

To force fresh download:

```bash
rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
```

| Section          | Source                                 |
| ---------------- | -------------------------------------- |
| Config schema    | `nemo_gym/config_types.py:306-349`     |
| Download logic   | `nemo_gym/hf_utils.py:57-115`          |
| Validation rules | `nemo_gym/config_types.py:334-349`     |
| Auto-download    | `nemo_gym/train_data_utils.py:476-494` |

## Next Steps

Preprocess raw data, run `ng_prepare_data`, and add `agent_ref` routing.

Use validated data to train with your preferred RL framework.