Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Data Preparation
The data_preparation
configuration in conf/config.yaml
specifies
the file to use for data preparation configuration. The
data_preparation
configuration must be specified as
t5/download_t5_pile
for preparing the Pile dataset for T5 models.
The configuration file is at
conf/data_preparation/t5/download_t5_pile.yaml
.
GPT models and T5 models use different tokenizer and vocab files. The default values are in the corresponding configuration files.
To download a reduced portion of the dataset for testing, set the
file_numbers
configuration to "0"
to download only
shard 0. (The value must be a
combination of numbers representing shards and hyphenated ranges representing ranges of shards, separated by commas (","
). For
example, this setting downloads and prepares
files 0, 3, 5, 6, and 7:
file_numbers="0,3,5-7"
Slurm
First, ensure that the cluster configuration in conf/cluster/bcm.yaml
is correct. Set the cluster
and cluster_type
configurations in conf/config.yaml
to bcm
. Then make any required changes to time_limit
and other configurations related to the job in download_t5_pile.yaml
for T5 models.
You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation: True
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on.
Store the vocab and merge files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for T5 models, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py data_preparation=t5/download_t5_pile \
stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe >> /results/data_t5_log.txt 2>&1
The command above assumes that you want to prepare the entire dataset (files
0-29), and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_t5_log.txt
, which you can
download from NGC. Any other required configuration can be added to modify the command’s behavior.
Common
Set the data preparation job’s configuration for T5 models in the YAML file:
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt" # URL to download the vocab from.
download_merges_url: null
vocab_save_dir: ${data_dir}/bpe
merges_save_dir: ${data_dir}/bpe
tokenizer_type: BertWordPieceCase # T5 models use BertWordPieceCase tokenizer
log_dir: ${base_results_dir}/data_preparation/t5_pile_logs # Where to save the logs
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.
nodes: 30
time_limit: "4:00:00"
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.