Data Preparation

The data_preparation configuration in conf/config.yaml specifies which file to use for data preparation configuration. The data_preparation configuration must be specified as download_mc4 to prepare the mC4 dataset for mT5 models. The configuration file can be found in conf/data_preparation/download_mc4.yaml. mT5 models use the SentencePiece multilingual tokenizer.

To download a reduced portion of the dataset to run tests, set the languages configuration to download only one of the languages by changing it to lv. The list of all 101 languages can be found in the mC4 dataset.

Parallelize data preparation by using multiple nodes (the default is 20 nodes) to download and preprocess all language files in parallel.

Slurm

First, ensure that the cluster configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then make any required changes to time_limit and other configurations related to the job in download_mc4.yaml for mT5 models.

Example

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - data_preparation

Then enter:

python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the tokenizer model file in the same workspace for later use.

The data preparation code must be launched in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.

Download the dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws workspace’s permissions.

To run the data preparation pipeline for mT5 models, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py data_preparation=mt5/download_mc4 \
stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results data_preparation.languages=\'cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh\' \
data_preparation.run.node_array_size=20 data_preparation.run.workers_per_node=4 >> /results/data_mt5_log.txt 2>&1

The command above assumes that you want to prepare the mC4 dataset with 24 languages, and that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_mt5_log.txt, which you can download from NGC. Any other required configuration can be added to modify the command’s behavior.

The full dataset may not fit into BCP workspaces. NVIDIA recommends using a smaller subset of languages (e.g. cs,da,de,el,fr,hi has total size 1TB). Any other configuration can also be added to the command to modify its behavior.

Common

Set the data preparation job’s configuration for mT5 models in the YAML file:

run:
  name: download_mc4
  results_dir: ${base_results_dir}/${.name}
  time_limit: "24:00:00"
  dependency: "singleton"
  node_array_size: 20
  cpus_per_node: 256
  workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: mc4
download_mc4: True  # Whether to download the mC4 dataset from the internet.
preprocess_data: True  # True to preprocess the data from a json.gz file, False otherwise.
mc4_dir: ${data_dir}/mc4 # Path to (m)C4 dataset repo.
git_lfs_dir: ${.mc4_dir}/lfs # Path to store git lfs files.
download_vocab_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.vocab # URL to download the vocab from.
download_tokenizer_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.model # URL to download tokenizer from
vocab_save_dir: ${.mc4_dir}/bpe
tokenizer_save_dir: ${.mc4_dir}/bpe
tokenizer_model: ${.tokenizer_save_dir}/mt5_tokenizer.model
languages: cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh # language list in mC4 dataset to download and preprocess. Use `all` to download and preprocess all languages or specify language list as `en,es,ko,zh,...`
use_cleaned_english: True # whether to use cleaned version of english data
softlinks_dir: ${.mc4_dir}/softlinks # Path to languages soft links for preprocessing
preprocessed_dir: ${.mc4_dir}/preprocessed
max_split_size: 200 # (GB) Each split will be preprocessed individually. Tune this down to accommodate short wall time on clusters
download_worker_mapping: ${.mc4_dir}/download_mapping
preprocess_worker_mapping: ${.mc4_dir}/preprocess_mapping
rm_downloaded: False # Script will not remove downloaded after preprocessing