Data Preparation

User Guide (Latest Version)

NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

NVIDIA recommends that the NeMo-Framework-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.

Note

The Pile dataset is no longer available for download. Data preparation will be updated with a replacement dataset.

  1. To run data preparation, update conf/config.yaml:

Copy
Copied!
            

defaults: - data_preparation: nemotron/download_nemotron_pile stages: - data_preparation

  1. Execute launcher pipeline: python3 main.py

Configuration the Dataset

You can find default configurations for data preparation in conf/data_preparation/nemotron/download_nemotron_pile.yaml.

To configure the dataset, run the following:

Copy
Copied!
            

run: name: download_nemotron_pile results_dir: ${base_results_dir}/${.name} time_limit: "4:00:00" dependency: "singleton" node_array_size: 30 array: ${..file_numbers} bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster. dataset: pile download_the_pile: True # Whether to download the pile dataset from the internet. the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/" # Source URL to download The Pile dataset from. file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download. preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise. download_tokenizer_url: null tokenizer_typzer_library: "sentencepiece" tokenizer_save_dir: ${data_dir}/nemotron tokenizer_model: ${.tokenizer_save_dir}/nemotron_tokenizer.model rm_downloaded: False # Extract script will remove downloaded zst after extraction rm_extracted: False # Preprocess script will remove extracted files after preproc.

The file_numbers parameter sets the portion of the dataset to download. The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’). For example, to download and prepare files 0, 3, 5, 6, and 7, specify the following:

Copy
Copied!
            

file_numbers: "0,3,5-7"

Previous Nemotron
Next Training with Predefined Configurations
© | | | | | | |. Last updated on Jun 19, 2024.