Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Data Preparation
The data_preparation
configuration in conf/config.yaml
specifies
which file to use for data preparation configuration. The
data_preparation
configuration must be specified as download_mc4
to prepare the mC4 dataset for mT5 models. The configuration file can be
found in conf/data_preparation/download_mc4.yaml
. mT5 models use
the SentencePiece multilingual tokenizer.
To download a reduced portion of the dataset to run tests, set the
languages
configuration to download only one of the
languages by changing it to lv
. The list of all 101 languages can be found in the
mC4 dataset.
Parallelize data preparation by using multiple nodes (the default is 20 nodes) to download and preprocess all language files in parallel.
Slurm
First, ensure that the cluster configuration in conf/cluster/bcm.yaml
is correct.
Set the cluster
and cluster_type
configurations in conf/config.yaml
to bcm
.
Then make any required changes to time_limit
and other configurations related to the job in download_mc4.yaml
for mT5 models.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace,
making the data visible across multiple jobs later on.
Store the tokenizer model file in the same workspace for later use.
The data preparation code must be launched in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
Download the dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for mT5 models, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py data_preparation=mt5/download_mc4 \
stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results data_preparation.languages=\'cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh\' \
data_preparation.run.node_array_size=20 data_preparation.run.workers_per_node=4 >> /results/data_mt5_log.txt 2>&1
The command above assumes that you want to prepare the mC4 dataset with 24 languages, and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_mt5_log.txt
, which you can
download from NGC. Any other required configuration can be added to modify the command’s behavior.
The full dataset may not
fit into BCP workspaces. NVIDIA recommends using a smaller subset of
languages (e.g. cs,da,de,el,fr,hi
has total size 1TB). Any other
configuration can also be added to the command to modify its behavior.
Common
Set the data preparation job’s configuration for mT5 models in the YAML file:
run:
name: download_mc4
results_dir: ${base_results_dir}/${.name}
time_limit: "24:00:00"
dependency: "singleton"
node_array_size: 20
cpus_per_node: 256
workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: mc4
download_mc4: True # Whether to download the mC4 dataset from the internet.
preprocess_data: True # True to preprocess the data from a json.gz file, False otherwise.
mc4_dir: ${data_dir}/mc4 # Path to (m)C4 dataset repo.
git_lfs_dir: ${.mc4_dir}/lfs # Path to store git lfs files.
download_vocab_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.vocab # URL to download the vocab from.
download_tokenizer_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.model # URL to download tokenizer from
vocab_save_dir: ${.mc4_dir}/bpe
tokenizer_save_dir: ${.mc4_dir}/bpe
tokenizer_model: ${.tokenizer_save_dir}/mt5_tokenizer.model
languages: cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh # language list in mC4 dataset to download and preprocess. Use `all` to download and preprocess all languages or specify language list as `en,es,ko,zh,...`
use_cleaned_english: True # whether to use cleaned version of english data
softlinks_dir: ${.mc4_dir}/softlinks # Path to languages soft links for preprocessing
preprocessed_dir: ${.mc4_dir}/preprocessed
max_split_size: 200 # (GB) Each split will be preprocessed individually. Tune this down to accommodate short wall time on clusters
download_worker_mapping: ${.mc4_dir}/download_mapping
preprocess_worker_mapping: ${.mc4_dir}/preprocess_mapping
rm_downloaded: False # Script will not remove downloaded after preprocessing