> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/gym/llms-full.txt.

# Download from Hugging Face

Download JSONL datasets from Hugging Face Hub for NeMo Gym training.

**Goal**: Download a dataset from Hugging Face Hub in JSONL format for training.

**Prerequisites**: NeMo Gym installed ([Installation](/latest/get-started/installation))

***

## Quick Start

```bash
ng_download_dataset_from_hf \
    +repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning \
    +split=train \
    +output_fpath=./data/train.jsonl
```

```text
[Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
```

<Note>
  NeMo Gym uses [Hydra](https://hydra.cc/) for configuration. Arguments use `+key=value` syntax.
</Note>

***

## Options

| Option           | Description                                                                                     |
| ---------------- | ----------------------------------------------------------------------------------------------- |
| `repo_id`        | **Required.** Hugging Face repository (e.g., `nvidia/Nemotron-RL-math-OpenMathReasoning`)       |
| `output_dirpath` | Output directory. Files named `{split}.jsonl`. **Use this OR `output_fpath`.**                  |
| `output_fpath`   | Exact output file path. Requires `split` or `artifact_fpath`. **Use this OR `output_dirpath`.** |
| `artifact_fpath` | Download a specific file from the repo (raw file mode)                                          |
| `split`          | Dataset split: `train`, `validation`, or `test`. Omit to download all.                          |
| `hf_token`       | Authentication token for private/gated repositories                                             |

***

## Download Methods

<Tabs>
  <Tab title="Structured Dataset (Recommended)">
    Downloads using the `datasets` library and converts to JSONL.

    **Use when**: Repository uses Hugging Face's standard dataset format.

    **All splits**:

    ```bash
    ng_download_dataset_from_hf \
        +repo_id=nvidia/Nemotron-RL-knowledge-mcqa \
        +output_dirpath=./data/
    ```

    ```text
    [Nemo-Gym] - Downloaded train split to: ./data/train.jsonl
    [Nemo-Gym] - Downloaded validation split to: ./data/validation.jsonl
    ```

    **Single split**:

    ```bash
    ng_download_dataset_from_hf \
        +repo_id=SWE-Gym/SWE-Gym \
        +split=train \
        +output_fpath=./data/train.jsonl
    ```
  </Tab>

  <Tab title="Raw File">
    Downloads a specific file directly without conversion.

    **Use when**: Repository contains pre-formatted JSONL files.

    ```bash
    ng_download_dataset_from_hf \
        +repo_id=nvidia/nemotron-RL-coding-competitive_coding \
        +artifact_fpath=opencodereasoning_filtered_25k_train.jsonl \
        +output_fpath=./data/train.jsonl
    ```

    ```text
    [Nemo-Gym] - Downloaded opencodereasoning_filtered_25k_train.jsonl to: ./data/train.jsonl
    ```
  </Tab>

  <Tab title="Python Script">
    Downloads using the `datasets` library directly with streaming support.

    **Use when**: You need custom preprocessing, streaming for large datasets, or specific split handling.

    ```python
    import json
    from datasets import load_dataset

    output_file = "train.jsonl"
    dataset_name = "nvidia/OpenMathInstruct-2"
    split_name = "train_1M"  # Check dataset page for available splits

    with open(output_file, "w", encoding="utf-8") as f:
        for line in load_dataset(dataset_name, split=split_name, streaming=True):
            f.write(json.dumps(line) + "\n")
    ```

    Run the script:

    ```bash
    uv run download.py
    ```

    Verify the download:

    ```bash
    wc -l train.jsonl
    # Expected: 1000000 train.jsonl
    ```

    **Streaming benefits**:

    * Memory-efficient for large datasets (millions of rows)
    * Progress visible during download

    <Note>
      For gated or private datasets, authenticate first:

      ```bash
      export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
      ```

      Or use `huggingface-cli login` before running the script.
    </Note>
  </Tab>
</Tabs>

***

## NVIDIA Datasets

Ready-to-use datasets for common training tasks:

| Dataset             | Repository                                                    | Domain      |
| ------------------- | ------------------------------------------------------------- | ----------- |
| OpenMathReasoning   | `nvidia/Nemotron-RL-math-OpenMathReasoning`                   | Math        |
| Competitive Coding  | `nvidia/nemotron-RL-coding-competitive_coding`                | Code        |
| Workplace Assistant | `nvidia/Nemotron-RL-agent-workplace_assistant`                | Agent       |
| Structured Outputs  | `nvidia/Nemotron-RL-instruction_following-structured_outputs` | Instruction |
| MCQA                | `nvidia/Nemotron-RL-knowledge-mcqa`                           | Knowledge   |

***

## Troubleshooting

<Accordion title="Authentication Failed (401)">
  ```text
  huggingface_hub.utils.HfHubHTTPError: 401 Client Error
  ```

  **Fix**: Verify your token is valid. For gated datasets, accept the license on Hugging Face first.
</Accordion>

<Accordion title="Repository Not Found (404)">
  ```text
  huggingface_hub.utils.HfHubHTTPError: 404 Client Error
  ```

  **Fix**: Check `repo_id` format is `organization/dataset-name`. Verify the repository exists and is public (or you have access).
</Accordion>

<Accordion title="Validation Error: Output Path">
  ```text
  ValueError: Either output_dirpath or output_fpath must be provided
  ```

  **Fix**: Add `+output_dirpath=./data/` or `+output_fpath=./data/train.jsonl`.
</Accordion>

<Accordion title="Validation Error: Conflicting Options">
  ```text
  ValueError: Cannot specify both artifact_fpath and split
  ```

  **Fix**: Use `artifact_fpath` for raw files OR `split` for structured datasets—not both.
</Accordion>

***

## Private Repositories

<Warning>
  Avoid passing tokens on the command line—they appear in shell history.
</Warning>

**Recommended** — Use environment variable:

```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxx
ng_download_dataset_from_hf \
    +repo_id=my-org/private-dataset \
    +output_dirpath=./data/
```

Get your token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Use a **read-only** token.

<Accordion title="Alternative: Pass token directly">
  Not recommended for shared systems:

  ```bash
  ng_download_dataset_from_hf \
      +repo_id=my-org/private-dataset \
      +hf_token=hf_xxxxxxxxxxxxxxxxxxxxxxxxx \
      +output_dirpath=./data/
  ```
</Accordion>

<Accordion title="Automatic Downloads During Data Preparation">
  NeMo Gym can automatically download missing datasets during data preparation. Configure `huggingface_identifier` in your resources server config:

  ```yaml
  datasets:
    - name: train
      type: train
      jsonl_fpath: resources_servers/code_gen/data/train.jsonl
      huggingface_identifier:
        repo_id: nvidia/nemotron-RL-coding-competitive_coding
        artifact_fpath: opencodereasoning_filtered_25k_train.jsonl
      license: Apache 2.0
  ```

  Run with download enabled:

  ```bash
  config_paths="resources_servers/code_gen/configs/code_gen.yaml"
  ng_prepare_data "+config_paths=[${config_paths}]" \
      +output_dirpath=./data/prepared \
      +mode=train_preparation \
      +should_download=true \
      +data_source=huggingface
  ```

  If `jsonl_fpath` doesn't exist locally, NeMo Gym downloads from `huggingface_identifier` before processing.
</Accordion>

<Accordion title="Caching Behavior">
  Downloads use Hugging Face's cache at `~/.cache/huggingface/`.

  * **Structured datasets**: Reads from cache (fast), overwrites output file
  * **Raw files**: Uses cached copy, then copies to output path

  To force fresh download:

  ```bash
  rm -rf ~/.cache/huggingface/hub/datasets--<org>--<dataset>
  ```
</Accordion>

<Accordion title="Source References">
  | Section          | Source                                 |
  | ---------------- | -------------------------------------- |
  | Config schema    | `nemo_gym/config_types.py:306-349`     |
  | Download logic   | `nemo_gym/hf_utils.py:57-115`          |
  | Validation rules | `nemo_gym/config_types.py:334-349`     |
  | Auto-download    | `nemo_gym/train_data_utils.py:476-494` |
</Accordion>

## Next Steps

<Cards>
  <Card title="Prepare and Validate" href="/latest/data/prepare-validate">
    Preprocess raw data, run `ng_prepare_data`, and add `agent_ref` routing.
  </Card>

  <Card title="Train with NeMo RL" href="/latest/training-tutorials/nemo-rl-grpo">
    Use validated data with NeMo RL for GRPO training.
  </Card>
</Cards>