The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.
NVIDIA recommends that the NeMo-Megatron-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.
The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.
The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.
To run data preparation update conf/config.yaml
:
defaults:
- data_preparation: llama/download_llama_pile
stages:
- data_preparation
Execute launcher pipeline: python3 main.py
Configuration
Default configurations for data preparation can be found in conf/data_preparation/download_llama_pile.yaml
.
run:
name: download_llama_pile
results_dir: ${base_results_dir}/${.name}
time_limit: "4:00:00"
dependency: "singleton"
node_array_size: 30
array: ${..file_numbers}
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_tokenizer_url: "https://huggingface.co/decapoda-research/llama-7b-hf/resolve/main/tokenizer.model"
tokenizer_typzer_library: "sentencepiece"
tokenizer_save_dir: ${data_dir}/llama
tokenizer_model: ${.tokenizer_save_dir}/llama_tokenizer.model
rm_downloaded: False # Extract script will remove downloaded zst after extraction
rm_extracted: False # Preprocess script will remove extracted files after preproc.
file_numbers
sets the portion of the dataset to download.
The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’).
For example, to download and prepare files 0, 3, 5, 6, and 7:
file_numbers: "0,3,5-7"