Data Preparation

NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

NVIDIA recommends that the NeMo-Framework-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.

Note

The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.

Run Data Preparation

  1. To run data preparation, update conf/config.yaml:

defaults:
  - data_preparation: nemotron/download_nemotron_pile

stages:
  - data_preparation
  1. Execute the launcher pipeline: python3 main.py.

Configure the Dataset

You can find default configurations for data preparation in conf/data_preparation/nemotron/download_nemotron_pile.yaml.

To configure the dataset, run the following:

run:
  name: download_nemotron_pile
  results_dir: ${base_results_dir}/${.name}
  time_limit: "4:00:00"
  dependency: "singleton"
  node_array_size: 30
  array: ${..file_numbers}
  bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

dataset: pile
download_the_pile: True  # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/"  # Source URL to download The Pile dataset from.
file_numbers: "0-29"  # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True  # True to preprocess the data from a jsonl file, False otherwise.
download_tokenizer_url: null
tokenizer_typzer_library: "sentencepiece"
tokenizer_save_dir: ${data_dir}/nemotron
tokenizer_model:  ${.tokenizer_save_dir}/nemotron_tokenizer.model
rm_downloaded: False # Extract script will remove downloaded zst after extraction
rm_extracted: False # Preprocess script will remove extracted files after preproc.

The file_numbers parameter sets the portion of the dataset to download. The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’). For example, to download and prepare files 0, 3, 5, 6, and 7, specify the following:

file_numbers: "0,3,5-7"