Data Preparation - NVIDIA Docs

The data_preparation configuration in conf/config.yaml specifies which file to use for data preparation configuration purposes. The default value is set to download_bert_pile, which can be found in conf/data_preparation/download_bert_pile.yaml. It is used to download, extract, and preprocess the Pile dataset for BERT model. Modify the configurations to perform different tasks and to decide where to store the datasets, vocab, etc.

To download a reduced portion of the dataset to run tests, you can set the file_numbers configuration to download only one of the shards by changing the value from "0-29" to "0" (the syntax must be a combination of numbers separated by dashes “-” or commas “,”) For example, file_numbers=“0,3,5-7” will download and prepare files 0, 3, 5, 6, and 7.

Slurm

First, ensure that the cluster configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then make any required changes to time_limit and other configurations related to the job in download_bert_pile.yaml for BERT models. You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.

Example

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - data_preparation

Then enter:

Copy
Copied!

            
            python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the vocab and merge files in the same workspace for later use.

You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.

You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws workspace’s permissions.

To run the data preparation pipeline for BERT models, enter:

Copy
Copied!

            
            python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_bert \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_bert_log.txt 2>&1

The command above assumes that you want to prepare the entire dataset (files 0-29), and that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_bert_log.txt, which you can download from NGC. You can add any other configuration required to modify the command’s behavior.

Common

Set the data preparation job’s configuration for BERT models in the YAML file:

Copy
Copied!

            
            run:
  name: download_bert_pile
  results_dir: ${base_results_dir}/${.name}
  time_limit: "4:00:00"
  dependency: "singleton"
  node_array_size: 30
  array: ${..file_numbers}
  bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

dataset: pile
download_the_pile: True  # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/"  # Source URL to download The Pile dataset from.
file_numbers: "0-29"  # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True  # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt"  # URL to download the vocab from.
vocab_save_dir: ${data_dir}/bpe
tokenizer_type: BertWordPieceLowerCase
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.