Data Preparation

The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

NVIDIA recommends that the NeMo-Framework-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.

Note

The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.

Run Data Preparation

To run data preparation update conf/config.yaml:

defaults:
  - data_preparation: Mistral/download_mistral_pile

stages:
  - data_preparation

Execute the launcher pipeline: python3 main.py.

Configuration

Default configurations for data preparation can be found in conf/data_preparation/mistral/download_mistral_pile.yaml.

run:
  name: download_mistral_pile
  results_dir: ${base_results_dir}/${.name}
  time_limit: "4:00:00"
  dependency: "singleton"
  node_array_size: 30
  array: ${..file_numbers}
  bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

dataset: pile
download_the_pile: True  # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/"  # Source URL to download The Pile dataset from.
file_numbers: "0-29"  # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True  # True to preprocess the data from a jsonl file, False otherwise.
download_tokenizer_url: "https://huggingface.co/decapoda-research/mistral-7b-hf/resolve/main/tokenizer.model"
tokenizer_typzer_library: "sentencepiece"
tokenizer_save_dir: ${data_dir}/mistral
tokenizer_model:  ${.tokenizer_save_dir}/mistral_tokenizer.model
rm_downloaded: False # Extract script will remove downloaded zst after extraction
rm_extracted: False # Preprocess script will remove extracted files after preproc.

file_numbers sets the portion of the dataset to download. The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’). For example, to download and prepare files 0, 3, 5, 6, and 7:

file_numbers: "0,3,5-7"