NeMo Framework has everything needed to train every NeMo Framework model. This includes setting up the compute cluster, downloading data, and selecting model hyperparameters. The default configurations for each model and task are tested on a regular basis and every configuration can be modified in order to train on new datasets or test new model hyperparameters.
NeMo Framework uses a set of Docker containers executed on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. The training container also includes conversion scripts. The inference container comprises the NVIDIA Triton Inference Server with the FasterTransformer backend installed.
Ensure that the high-speed file system is mounted on the job submission node(s) at the same path as on the compute nodes.
Slurm
The NeMo framework codebase is included as part of the training
container. To copy it to a local directory in the cluster, you must
extract it from the container. You can execute the following command to copy the code to a directory named
/path/to/local/dir
. The NeMo framework repository for Slurm has been
verified on Base Command Manager as well as Slurm-based DeepOps clusters.
srun -p <partition> -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=<container_tag> bash -c "cp -r /opt/NeMo-Megatron-Launcher /workspace/mount_dir/"
Install the NeMo framework scripts’ dependencies on the head node of the cluster:
pip install -r requirements.txt
You can use venv to kept your head node environment in its original state
for other Python projects. If your configuration lacks pip
, you can
install pip
from
get_pip.py
with just python3
. To install the dependencies to a virtual environment
run the following commands before the ‘pip install’ above:
python3 -m venv venv
source ./venv/bin/activate
Base Command Platform
The nemo_megatron_launcher
codebase is included as part of the training
container. Before you start, set up the NGC CLI and configure it as
described in Base Command Platform User Guide.
This guide’s examples mainly use two Base Command Platform workspaces, one for storing
the training dataset, and another for storing the results, checkpoints,
and logs. Therefore, start by creating these workspaces. NVIDIA suggests that you give them names like
nemo_megatron_data_ws
and nemo_megatron_results_ws
.
Base Command Platform User Guide explains how to create and work with
Base Command Platform workspaces.
Kubernetes
Data preparation, base model training, evaluation, and conversion of GPT models is currently supported on vanilla kubernetes (k8s) clusters. The launcher scripts will generate a Helm chart for each task based on the config files and launch the job using the chart.
- The following is required for running jobs on Kubernetes:
One or more DGX A100s/H100s as worker nodes
An NFS filesystem where the data and launcher scripts will be stored which is accessible on all worker and controller nodes
A head/controller node which has access to the worker nodes and can run
kubectl
andhelm
to launch jobs and can install Python dependenciesRecent versions of the GPU, Network, and KubeFlow Operators installed
A secret key needs to be configured to allow kubernetes to pull from the private registry. For example, if pulling the container directly from NGC, a secret needs to be created to authenticate with the private NGC registry, such as the following:
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
The launcher scripts need to be downloaded to the NFS filesystem that is
connected to the worker nodes. This can either be copied at
/opt/NeMo-Megatron-Launcher
from inside the training container or by cloning
this repository.
Install the NeMo Framework scripts dependencies on the head node/controller of the cluster where jobs will be launched:
pip install -r requirements.txt
General Configuration
All config files referenced in this section and throughout are in launcher_scripts
(except for the
AutoConfigurator’s config files which are in auto_configurator
).
The first configuration you must set is launcher_scripts_path
in the file conf/config.yaml
.
This configuration must point to the absolute path where the nemo_megatron_launcher
repository is stored in the file system.
Additionally, if you are using a Slurm-based cluster, the file (conf/cluster/bcm.yaml
)
has the configurations needed to set generic cluster related information, such as the partition
or account
.
For Kubernetes clusters, the k8s configuration file (conf/cluster/k8s.yaml
)
has necessary parameters to configure the scripts to run on the cluster, such as
ib_resource_name
, nfs_server
, and pull_secret
.
The NUMA mapping can also be configured from the file conf/config.yaml
.
The mapping should be automatic; the code will read the number of
CPU cores available in your cluster, and provide the best possible
mapping, to maximize performance. The mapping is enabled by default, but
it can be disabled by setting enable: False
in the numa_mapping
section of conf/config.yaml
. The type of mapping can also
be configured using the same file. See the full configurations below:
numa_mapping:
enable: True # Set to False to disable all mapping (performance will suffer).
mode: unique_contiguous # One of: all, single, single_unique, unique_interleaved or unique_contiguous.
scope: node # Either node or socket.
cores: all_logical # Either all_logical or single_logical.
balanced: True # Whether to assing an equal number of physical cores to each process.
min_cores: 1 # Minimum number of physical cores per process.
max_cores: 8 # Maximum number of physical cores per process. Can be null to use all available cores.
Slurm: The
launcher_scripts_path
will automatically be mounted to the container at the same path as in the local file system. Any additional directories that should be mounted must be specified using thecontainer_mounts
configuration. If the paths contain the colon character (:
), the code will assume both the source and destination paths are provided. Otherwise, the given paths will be mounted to the same path inside the container. Thedata_dir
configuration can also be modified to point to where the dataset will be loaded from or saved. Thebase_results_dir
can also be modified to point to where the results, checkpoints and logs will be stored. These last two directories will be automatically mounted into the container. The configurationscluster
andcluster_type
must be set tobcm
for all of the tasks.Base Command Platform: Set
launcher_scripts_path
to/opt/NeMo-Megatron-Launcher/launcher_scripts
, the default location of the scripts in the container.Kubernetes: The
launcher_scripts_path
parameter needs to be set to the NFS path where the NeMo-Megatron-Launcher code is located.Modify
data_dir
to point to the location from which the dataset is loaded or to which it is saved, andbase_results_dir
to the directory in which results, checkpoints, and logs are to be stored. In the case of Base Command Platform, NVIDIA recommends that you pointdata_dir``to one of the workspaces, and ``base_results_dir
to the other. Both must be mounted in read-write mode. The configurationcluster_type
must be set tobcp
for all of the tasks.The file
main.py
is the top-level file that is executed to run the data preparation, training, conversion, fine-tuning, and evaluation pipelines. The fileconf/config.yaml
contains configurations that determine which of these pipelines is run. In Slurm-based clusters all of them can be set toTrue
at the same time, and they will be executed in order. In Base Command Platform, though, only one of them may be set toTrue
at a given time.Settings for GPT Models: To use the default settings for GPT models, set
config/config.yaml
as below:stages: - data_preparation - training - conversion - evaluation - export
Settings for T5 Models: To use the default settings for T5 models, set
config/config.yaml
as below:# default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: t5/download_t5_pile training: t5/220m conversion: t5/convert_t5 fine_tuning: t5/squad evaluation: t5/squad export: t5/export_t5 stages: - data_preparation - training - conversion - fine_tuning - prompt_learning - evaluation - export
Settings for mT5 Models: To use the default settings for mT5 models, set
config/config.yaml
as below:# default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: mt5/download_mc4 training: mt5/390m conversion: mt5/convert_mt5 fine_tuning: mt5/xquad evaluation: mt5/xquad export: mt5/export_mt5 stages: - data_preparation - training - conversion - fine_tuning - prompt_learning - evaluation - export
Settings for BERT Models: To use the default settings for Bert models, set
config/config.yaml
as below:# default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: bert/download_bert_pile training: bert/4b stages: - data_preparation - training
To run the pipelines, execute:
python3 main.py
The entire repository uses hydra/omegaconf
to handle job
configuration using YAML files, so look at the documentation for those
projects to learn more.
You must prepare several datasets for the NeMo framework to use, depending on the type of model you are using.
The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.
NVIDIA recommends that the NeMo-Megatron-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.
The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.
mC4: The Multilingual C4 (mC4) dataset has 101 languages and is generated from 71 Common Crawl dumps. NVIDIA provides utilities to download and prepare the mC4 dataset. (allen-ai version). NVIDIA recommends that this datasets be stored in a file system shared by all of the nodes. A shared workspace with read-write permissions is recommended in the case of Base Command Platform based clusters.
NVIDIA provides scripts that give you the option to download and preprocess any subset of the dataset’s 101 languages. A selection of 24 languages are included in the default language list. The raw size of the default language set is around 5 TB.
Parallelization is enabled in the downloading and preprocessing scripts. It provides a significant speed-up by automatically distributing and balancing the work on multi-node systems. Downloading and preprocessing the default language list takes approximately 7 hours, assuming a 30 MB/sec download speed and parallelization using 20 nodes.
The preprocessed dataset’s size is around 12 TB. NVIDIA recommends that you use a file system with more than 20 TB of free space to prepare the data.
NVIDIA currently does not support training with more than 25 languages. (See Known Issues.)
All config files for data preparation are in launcher_scripts
.
The configuration used for data preparation for the Pile dataset or mC4
dataset must be specified in the conf/config.yaml
file and
data_preparation
must be included in stages
to run it.
Data Preparation for GPT Models
The data_preparation
configuration in conf/config.yaml
specifies
the data preparation configuration file that will be used.
Its default value is download_gpt3_pile
, which corresponds to the file conf/data_preparation/download_gpt3_pile.yaml
.
The configurations in the data preparation configuration file are used to download, extract, and preprocess the Pile dataset for GPT models. Modify these configurations to control data preparation tasks and to specify where to store the datasets, vocabulary, and merge files.
To download a reduced portion of the dataset to run tests, you can set the
file_numbers
configuration to download only one of the
shards by changing “0-29” to “0.” The value must be a series of
numbers separated by hyphens (’‑‘) or commas ‘,’). For example,
this configuration value would download and prepare files 0, 3, 5, 6,
and 7:
file_numbers="0,3,5-7"
Slurm
First, ensure that the cluster configuration in conf/cluster/bcm.yaml
is correct. Set the cluster
and cluster_type
configurations in conf/config.yaml
to bcm
. Then make any required changes to time_limit
and other configurations related
to the job in download_gpt3_pile.yaml
for GPT models.
You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on.
Store the vocabulary and merge files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for GPT models, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_gpt3_log.txt 2>&1
The command above assumes that you want to prepare the entire dataset (files
0-29), and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_gpt3_log.txt
, which you can
download from NGC. You can add any other configuration required to modify the
command’s behavior.
Kubernetes
To run data preparation on a kubernetes cluster, set both the cluster
and
cluster_type
parameters to k8s
in conf/config.yaml
. Additionally, set the
launcher_scripts_path
parameter to the location where the launcher scripts
are located on the NFS filesystem. This must be the same path on all nodes in
the cluster. Ensure the stages
parameter is set to data_preparation
and
data_preparation
in the defaults
section points to the intended data
preparation script.
The conf/config/k8s.yaml
file also needs to be updated with the
kubernetes container registry secret if created earlier (pull_secret
), the
shm_size
to determine how much local memory to put in each pod, and the NFS
server and path to where the launcher scripts are saved. These can all be
overridden from the command line using hydra as well.
Once all of the config files are updated, the data preparation can be launched from the controller node with:
python main.py
This will generate and launch a job via Helm in the default namespace which
can be viewed with helm show
or kubectl get pods
. The logs can be followed
with kubectl logs <pod-name>
for the first pod deployed for the job.
Common
Set the data preparation job’s configuration for GPT models in the YAML file:
run:
name: download_gpt3_pile
results_dir: ${base_results_dir}/${.name}
time_limit: "4:00:00"
dependency: "singleton"
node_array_size: 30
array: ${..file_numbers}
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://huggingface.co/gpt2/resolve/main/vocab.json" # URL to download the vocab from.
download_merges_url: "https://huggingface.co/gpt2/resolve/main/merges.txt" # URL to download the merges from.
vocab_save_dir: ${data_dir}/bpe
merges_save_dir: ${data_dir}/bpe
tokenizer_type: GPT2BPETokenizer
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.
Data Preparation for T5 Models
The data_preparation
configuration in conf/config.yaml
specifies
the file to use for data preparation configuration. The
data_preparation
configuration must be specified as
t5/download_t5_pile
for preparing the Pile dataset for T5 models.
The configuration file is at
conf/data_preparation/t5/download_t5_pile.yaml
.
GPT models and T5 models use different tokenizer and vocab files. The default values are in the corresponding configuration files.
To download a reduced portion of the dataset for testing, set the
file_numbers
configuration to "0"
to download only
shard 0. (The value must be a
combination of numbers representing shards and hyphenated ranges representing ranges of shards, separated by commas (","
). For
example, this setting downloads and prepares
files 0, 3, 5, 6, and 7:
file_numbers="0,3,5-7"
Slurm
First, ensure that the cluster configuration in conf/cluster/bcm.yaml
is correct. Set the cluster
and cluster_type
configurations in conf/config.yaml
to bcm
. Then make any required changes to time_limit
and other configurations related to the job in download_t5_pile.yaml
for T5 models.
You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation: True
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on.
Store the vocab and merge files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for T5 models, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py data_preparation=t5/download_t5_pile \
stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe >> /results/data_t5_log.txt 2>&1
The command above assumes that you want to prepare the entire dataset (files
0-29), and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_t5_log.txt
, which you can
download from NGC. Any other required configuration can be added to modify the command’s behavior.
Common
Set the data preparation job’s configuration for T5 models in the YAML file:
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt" # URL to download the vocab from.
download_merges_url: null
vocab_save_dir: ${data_dir}/bpe
merges_save_dir: ${data_dir}/bpe
tokenizer_type: BertWordPieceCase # T5 models use BertWordPieceCase tokenizer
log_dir: ${base_results_dir}/data_preparation/t5_pile_logs # Where to save the logs
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.
nodes: 30
time_limit: "4:00:00"
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
Data Preparation for mT5 Models
The data_preparation
configuration in conf/config.yaml
specifies
which file to use for data preparation configuration. The
data_preparation
configuration must be specified as download_mc4
to prepare the mC4 dataset for mT5 models. The configuration file can be
found in conf/data_preparation/download_mc4.yaml
. mT5 models use
the SentencePiece multilingual tokenizer.
To download a reduced portion of the dataset to run tests, set the
languages
configuration to download only one of the
languages by changing it to lv
. The list of all 101 languages can be found in the
mC4 dataset.
Parallelize data preparation by using multiple nodes (the default is 20 nodes) to download and preprocess all language files in parallel.
Slurm
First, ensure that the cluster configuration in conf/cluster/bcm.yaml
is correct.
Set the cluster
and cluster_type
configurations in conf/config.yaml
to bcm
.
Then make any required changes to time_limit
and other configurations related to the job in download_mc4.yaml
for mT5 models.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace,
making the data visible across multiple jobs later on.
Store the tokenizer model file in the same workspace for later use.
The data preparation code must be launched in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
Download the dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for mT5 models, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py data_preparation=mt5/download_mc4 \
stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results data_preparation.languages=\'cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh\' \
data_preparation.run.node_array_size=20 data_preparation.run.workers_per_node=4 >> /results/data_mt5_log.txt 2>&1
The command above assumes that you want to prepare the mC4 dataset with 24 languages, and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_mt5_log.txt
, which you can
download from NGC. Any other required configuration can be added to modify the command’s behavior.
The full dataset may not
fit into BCP workspaces. NVIDIA recommends using a smaller subset of
languages (e.g. cs,da,de,el,fr,hi
has total size 1TB). Any other
configuration can also be added to the command to modify its behavior.
Common
Set the data preparation job’s configuration for mT5 models in the YAML file:
run:
name: download_mc4
results_dir: ${base_results_dir}/${.name}
time_limit: "24:00:00"
dependency: "singleton"
node_array_size: 20
cpus_per_node: 256
workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: mc4
download_mc4: True # Whether to download the mC4 dataset from the internet.
preprocess_data: True # True to preprocess the data from a json.gz file, False otherwise.
mc4_dir: ${data_dir}/mc4 # Path to (m)C4 dataset repo.
git_lfs_dir: ${.mc4_dir}/lfs # Path to store git lfs files.
download_vocab_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.vocab # URL to download the vocab from.
download_tokenizer_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.model # URL to download tokenizer from
vocab_save_dir: ${.mc4_dir}/bpe
tokenizer_save_dir: ${.mc4_dir}/bpe
tokenizer_model: ${.tokenizer_save_dir}/mt5_tokenizer.model
languages: cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh # language list in mC4 dataset to download and preprocess. Use `all` to download and preprocess all languages or specify language list as `en,es,ko,zh,...`
use_cleaned_english: True # whether to use cleaned version of english data
softlinks_dir: ${.mc4_dir}/softlinks # Path to languages soft links for preprocessing
preprocessed_dir: ${.mc4_dir}/preprocessed
max_split_size: 200 # (GB) Each split will be preprocessed individually. Tune this down to accommodate short wall time on clusters
download_worker_mapping: ${.mc4_dir}/download_mapping
preprocess_worker_mapping: ${.mc4_dir}/preprocess_mapping
rm_downloaded: False # Script will not remove downloaded after preprocessing
Data Preparation for BERT Models
The data_preparation
configuration in conf/config.yaml
specifies
which file to use for data preparation configuration purposes. The
default value is set to download_bert_pile
, which can be found in
conf/data_preparation/download_bert_pile.yaml
. It is used to
download, extract, and preprocess the Pile dataset for BERT model.
Modify the configurations to perform different tasks and to decide
where to store the datasets, vocab, etc.
To download a reduced portion of the dataset to run tests, you can set the
file_numbers
configuration to download only one of the
shards by changing the value from "0-29"
to "0"
(the syntax must be a combination of
numbers separated by dashes “-” or commas “,”) For example,
file_numbers
=“0,3,5-7” will download and prepare files 0, 3, 5, 6,
and 7.
Slurm
First, ensure that the cluster configuration in
conf/cluster/bcm.yaml
is correct. Set the cluster
and
cluster_type
configurations in conf/config.yaml
to
bcm
. Then make any required changes to time_limit
and other configurations related
to the job in download_bert_pile.yaml
for BERT models. You
can parallelize data preparation by using up to 30 nodes to download
all 30 files in parallel.
Example
To run only the data preparation pipeline and not the training,
evaluation or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- data_preparation
Then enter:
python3 main.py
Base Command Platform
To run the data preparation script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
This configuration can be overridden from the command line using hydra.
By default, the data preparation script downloads the data into the data/
directory.
NVIDIA recommends that you set the data_dir
configuration to a workspace, making the data visible across multiple jobs later on. Store the vocab and merge files in the same workspace for later use.
You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.
You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws
workspace’s permissions.
To run the data preparation pipeline for BERT models, enter:
python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_bert \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_bert_log.txt 2>&1
The command above assumes that you want to prepare the entire dataset (files 0-29), and that you mounted the data workspace in /mount/data
, and the
results workspace in /mount/results
. stdout
and stderr
are
redirected to the file /results/data_bert_log.txt
, which you can
download from NGC. You can add any other configuration required to modify the command’s behavior.
Common
Set the data preparation job’s configuration for BERT models in the YAML file:
run:
name: download_bert_pile
results_dir: ${base_results_dir}/${.name}
time_limit: "4:00:00"
dependency: "singleton"
node_array_size: 30
array: ${..file_numbers}
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
dataset: pile
download_the_pile: True # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from.
file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt" # URL to download the vocab from.
vocab_save_dir: ${data_dir}/bpe
tokenizer_type: BertWordPieceLowerCase
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.