Data Preparation

  • The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

    NVIDIA recommends that the NeMo-Megatron-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

    The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.


The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.

To run data preparation update conf/config.yaml:


defaults: - data_preparation: falcon/download_falcon_pile stages: - data_preparation

Execute launcher pipeline: python3


Default configurations for data preparation can be found in conf/data_preparation/falcon/download_falcon_pile.yaml.


run: name: download_falcon_pile results_dir: ${base_results_dir}/${.name} time_limit: "4:00:00" dependency: "singleton" node_array_size: 30 array: ${..file_numbers} bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster. dataset: pile download_the_pile: True # Whether to download the pile dataset from the internet. the_pile_url: "" # Source URL to download The Pile dataset from. file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download. preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise. tokenizer_library: "huggingface" tokenizer_type: tiiuae/falcon-7b rm_downloaded: False # Extract script will remove downloaded zst after extraction rm_extracted: False # Preprocess script will remove extracted files after preproc.

file_numbers sets the portion of the dataset to download. The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’). For example, to download and prepare files 0, 3, 5, 6, and 7:


file_numbers: "0,3,5-7"

