Data Preparation

NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

NVIDIA recommends that the NeMo-Framework-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.


The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.

Run Data Preparation

  1. To run data preparation, update conf/config.yaml:

  - data_preparation: nemotron/download_nemotron_pile

  - data_preparation
  1. Execute the launcher pipeline: python3

Configure the Dataset

You can find default configurations for data preparation in conf/data_preparation/nemotron/download_nemotron_pile.yaml.

To configure the dataset, run the following:

  name: download_nemotron_pile
  results_dir: ${base_results_dir}/${.name}
  time_limit: "4:00:00"
  dependency: "singleton"
  node_array_size: 30
  array: ${..file_numbers}
  bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

dataset: pile
download_the_pile: True  # Whether to download the pile dataset from the internet.
the_pile_url: ""  # Source URL to download The Pile dataset from.
file_numbers: "0-29"  # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True  # True to preprocess the data from a jsonl file, False otherwise.
download_tokenizer_url: null
tokenizer_typzer_library: "sentencepiece"
tokenizer_save_dir: ${data_dir}/nemotron
tokenizer_model:  ${.tokenizer_save_dir}/nemotron_tokenizer.model
rm_downloaded: False # Extract script will remove downloaded zst after extraction
rm_extracted: False # Preprocess script will remove extracted files after preproc.

The file_numbers parameter sets the portion of the dataset to download. The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’). For example, to download and prepare files 0, 3, 5, 6, and 7, specify the following:

file_numbers: "0,3,5-7"