NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.
NVIDIA recommends that the NeMo-Framework-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.
The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.
The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.
To run data preparation, update
conf/config.yaml
:
defaults:
- data_preparation: nemotron/download_nemotron_pile
stages:
- data_preparation
Execute launcher pipeline:
python3 main.py
Configuration the Dataset
You can find default configurations for data preparation in conf/data_preparation/nemotron/download_nemotron_pile.yaml
.
To configure the dataset, run the following:
run:
name: download_nemotron_pile
results_dir: ${base_results_dir}/${.name}
time_limit: "4:00:00"
dependency: "singleton"
node_array_size: 30
array: ${..file_numbers}
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_tokenizer_url: null
tokenizer_typzer_library: "sentencepiece"
tokenizer_save_dir: ${data_dir}/nemotron
tokenizer_model: ${.tokenizer_save_dir}/nemotron_tokenizer.model
rm_downloaded: False # Extract script will remove downloaded zst after extraction
rm_extracted: False # Preprocess script will remove extracted files after preproc.
The file_numbers
parameter sets the portion of the dataset to download.
The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’).
For example, to download and prepare files 0, 3, 5, 6, and 7, specify the following:
file_numbers: "0,3,5-7"