Training NeMo Framework Models

NeMo Framework has everything needed to train every NeMo Framework model. This includes setting up the compute cluster, downloading data, and selecting model hyperparameters. The default configurations for each model and task are tested on a regular basis and every configuration can be modified in order to train on new datasets or test new model hyperparameters.

NeMo Framework uses a set of Docker containers executed on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. The training container also includes conversion scripts. The inference container comprises the NVIDIA Triton Inference Server with the FasterTransformer backend installed.

Note

Ensure that the high-speed file system is mounted on the job submission node(s) at the same path as on the compute nodes.

Slurm

The NeMo framework codebase is included as part of the training container. To copy it to a local directory in the cluster, you must extract it from the container. You can execute the following command to copy the code to a directory named /path/to/local/dir. The NeMo framework repository for Slurm has been verified on Base Command Manager as well as Slurm-based DeepOps clusters.

Copy
Copied!
            

srun -p <partition> -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=<container_tag> bash -c "cp -r /opt/NeMo-Megatron-Launcher /workspace/mount_dir/"

Install the NeMo framework scripts’ dependencies on the head node of the cluster:

Copy
Copied!
            

pip install -r requirements.txt

You can use venv to kept your head node environment in its original state for other Python projects. If your configuration lacks pip, you can install pip from get_pip.py with just python3. To install the dependencies to a virtual environment run the following commands before the ‘pip install’ above:

Copy
Copied!
            

python3 -m venv venv source ./venv/bin/activate

Base Command Platform

The nemo_megatron_launcher codebase is included as part of the training container. Before you start, set up the NGC CLI and configure it as described in Base Command Platform User Guide.

This guide’s examples mainly use two Base Command Platform workspaces, one for storing the training dataset, and another for storing the results, checkpoints, and logs. Therefore, start by creating these workspaces. NVIDIA suggests that you give them names like nemo_megatron_data_ws and nemo_megatron_results_ws. Base Command Platform User Guide explains how to create and work with Base Command Platform workspaces.

Kubernetes

Data preparation, base model training, evaluation, and conversion of GPT models is currently supported on vanilla kubernetes (k8s) clusters. The launcher scripts will generate a Helm chart for each task based on the config files and launch the job using the chart.

The following is required for running jobs on Kubernetes:
  • One or more DGX A100s/H100s as worker nodes

  • An NFS filesystem where the data and launcher scripts will be stored which is accessible on all worker and controller nodes

  • A head/controller node which has access to the worker nodes and can run kubectl and helm to launch jobs and can install Python dependencies

  • Recent versions of the GPU, Network, and KubeFlow Operators installed

A secret key needs to be configured to allow kubernetes to pull from the private registry. For example, if pulling the container directly from NGC, a secret needs to be created to authenticate with the private NGC registry, such as the following:

Copy
Copied!
            

kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>

The launcher scripts need to be downloaded to the NFS filesystem that is connected to the worker nodes. This can either be copied at /opt/NeMo-Megatron-Launcher from inside the training container or by cloning this repository.

Install the NeMo Framework scripts dependencies on the head node/controller of the cluster where jobs will be launched:

Copy
Copied!
            

pip install -r requirements.txt

General Configuration

All config files referenced in this section and throughout are in launcher_scripts (except for the AutoConfigurator’s config files which are in auto_configurator).

The first configuration you must set is launcher_scripts_path in the file conf/config.yaml. This configuration must point to the absolute path where the nemo_megatron_launcher repository is stored in the file system.

Additionally, if you are using a Slurm-based cluster, the file (conf/cluster/bcm.yaml) has the configurations needed to set generic cluster related information, such as the partition or account.

For Kubernetes clusters, the k8s configuration file (conf/cluster/k8s.yaml) has necessary parameters to configure the scripts to run on the cluster, such as ib_resource_name, nfs_server, and pull_secret.

The NUMA mapping can also be configured from the file conf/config.yaml. The mapping should be automatic; the code will read the number of CPU cores available in your cluster, and provide the best possible mapping, to maximize performance. The mapping is enabled by default, but it can be disabled by setting enable: False in the numa_mapping section of conf/config.yaml. The type of mapping can also be configured using the same file. See the full configurations below:

Copy
Copied!
            

numa_mapping: enable: True # Set to False to disable all mapping (performance will suffer). mode: unique_contiguous # One of: all, single, single_unique, unique_interleaved or unique_contiguous. scope: node # Either node or socket. cores: all_logical # Either all_logical or single_logical. balanced: True # Whether to assing an equal number of physical cores to each process. min_cores: 1 # Minimum number of physical cores per process. max_cores: 8 # Maximum number of physical cores per process. Can be null to use all available cores.

  • Slurm: The launcher_scripts_path will automatically be mounted to the container at the same path as in the local file system. Any additional directories that should be mounted must be specified using the container_mounts configuration. If the paths contain the colon character (:), the code will assume both the source and destination paths are provided. Otherwise, the given paths will be mounted to the same path inside the container. The data_dir configuration can also be modified to point to where the dataset will be loaded from or saved. The base_results_dir can also be modified to point to where the results, checkpoints and logs will be stored. These last two directories will be automatically mounted into the container. The configurations cluster and cluster_type must be set to bcm for all of the tasks.

  • Base Command Platform: Set launcher_scripts_path to /opt/NeMo-Megatron-Launcher/launcher_scripts, the default location of the scripts in the container.

  • Kubernetes: The launcher_scripts_path parameter needs to be set to the NFS path where the NeMo-Megatron-Launcher code is located.

    Modify data_dir to point to the location from which the dataset is loaded or to which it is saved, and base_results_dir to the directory in which results, checkpoints, and logs are to be stored. In the case of Base Command Platform, NVIDIA recommends that you point data_dir``to one of the workspaces, and ``base_results_dir to the other. Both must be mounted in read-write mode. The configuration cluster_type must be set to bcp for all of the tasks.

    The file main.py is the top-level file that is executed to run the data preparation, training, conversion, fine-tuning, and evaluation pipelines. The file conf/config.yaml contains configurations that determine which of these pipelines is run. In Slurm-based clusters all of them can be set to True at the same time, and they will be executed in order. In Base Command Platform, though, only one of them may be set to True at a given time.

  • Settings for GPT Models: To use the default settings for GPT models, set config/config.yaml as below:

    Copy
    Copied!
                

    stages: - data_preparation - training - conversion - evaluation - export

  • Settings for T5 Models: To use the default settings for T5 models, set config/config.yaml as below:

    Copy
    Copied!
                

    # default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: t5/download_t5_pile training: t5/220m conversion: t5/convert_t5 fine_tuning: t5/squad evaluation: t5/squad export: t5/export_t5 stages: - data_preparation - training - conversion - fine_tuning - prompt_learning - evaluation - export

  • Settings for mT5 Models: To use the default settings for mT5 models, set config/config.yaml as below:

    Copy
    Copied!
                

    # default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: mt5/download_mc4 training: mt5/390m conversion: mt5/convert_mt5 fine_tuning: mt5/xquad evaluation: mt5/xquad export: mt5/export_mt5 stages: - data_preparation - training - conversion - fine_tuning - prompt_learning - evaluation - export

  • Settings for BERT Models: To use the default settings for Bert models, set config/config.yaml as below:

    Copy
    Copied!
                

    # default values: cluster: bcm # Leave it as bcm even if using bcp. It will be ignored for bcp. data_preparation: bert/download_bert_pile training: bert/4b stages: - data_preparation - training

To run the pipelines, execute:

Copy
Copied!
            

python3 main.py

The entire repository uses hydra/omegaconf to handle job configuration using YAML files, so look at the documentation for those projects to learn more.

You must prepare several datasets for the NeMo framework to use, depending on the type of model you are using.

  • The Pile: NVIDIA provides utilities to download and prepare the Pile dataset. This dataset comprises 22 smaller datasets, blended by using the mix described in the paper The Pile: An 800GB Dataset of Diverse Text for Language Modeling, by Gao et. al., mirrored here.

    NVIDIA recommends that the NeMo-Megatron-Launcher repository and the datasets are stored in a file system shared by all of the nodes during training. A shared workspace with read-write permissions is recommended in the case of Base Command Platform-based clusters.

    The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each shard takes approximately one hour assuming a 30 MB/sec download speed. The data preparation can be parallelized by using up to 30 nodes.

  • mC4: The Multilingual C4 (mC4) dataset has 101 languages and is generated from 71 Common Crawl dumps. NVIDIA provides utilities to download and prepare the mC4 dataset. (allen-ai version). NVIDIA recommends that this datasets be stored in a file system shared by all of the nodes. A shared workspace with read-write permissions is recommended in the case of Base Command Platform based clusters.

    NVIDIA provides scripts that give you the option to download and preprocess any subset of the dataset’s 101 languages. A selection of 24 languages are included in the default language list. The raw size of the default language set is around 5 TB.

    Parallelization is enabled in the downloading and preprocessing scripts. It provides a significant speed-up by automatically distributing and balancing the work on multi-node systems. Downloading and preprocessing the default language list takes approximately 7 hours, assuming a 30 MB/sec download speed and parallelization using 20 nodes.

    The preprocessed dataset’s size is around 12 TB. NVIDIA recommends that you use a file system with more than 20 TB of free space to prepare the data.

    NVIDIA currently does not support training with more than 25 languages. (See Known Issues.)

All config files for data preparation are in launcher_scripts.

The configuration used for data preparation for the Pile dataset or mC4 dataset must be specified in the conf/config.yaml file and data_preparation must be included in stages to run it.

Data Preparation for GPT Models

The data_preparation configuration in conf/config.yaml specifies the data preparation configuration file that will be used. Its default value is download_gpt3_pile, which corresponds to the file conf/data_preparation/download_gpt3_pile.yaml.

The configurations in the data preparation configuration file are used to download, extract, and preprocess the Pile dataset for GPT models. Modify these configurations to control data preparation tasks and to specify where to store the datasets, vocabulary, and merge files.

To download a reduced portion of the dataset to run tests, you can set the file_numbers configuration to download only one of the shards by changing “0-29” to “0.” The value must be a series of numbers separated by hyphens (’‑‘) or commas ‘,’). For example, this configuration value would download and prepare files 0, 3, 5, 6, and 7:

Copy
Copied!
            

file_numbers="0,3,5-7"

Slurm

First, ensure that the cluster configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then make any required changes to time_limit and other configurations related to the job in download_gpt3_pile.yaml for GPT models.

You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.

Example

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - data_preparation

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the vocabulary and merge files in the same workspace for later use.

You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.

You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws workspace’s permissions.

To run the data preparation pipeline for GPT models, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results data_preparation.file_numbers='0-29' \ data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_gpt3_log.txt 2>&1

The command above assumes that you want to prepare the entire dataset (files 0-29), and that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You can add any other configuration required to modify the command’s behavior.

Kubernetes

To run data preparation on a kubernetes cluster, set both the cluster and cluster_type parameters to k8s in conf/config.yaml. Additionally, set the launcher_scripts_path parameter to the location where the launcher scripts are located on the NFS filesystem. This must be the same path on all nodes in the cluster. Ensure the stages parameter is set to data_preparation and data_preparation in the defaults section points to the intended data preparation script.

The conf/config/k8s.yaml file also needs to be updated with the kubernetes container registry secret if created earlier (pull_secret), the shm_size to determine how much local memory to put in each pod, and the NFS server and path to where the launcher scripts are saved. These can all be overridden from the command line using hydra as well.

Once all of the config files are updated, the data preparation can be launched from the controller node with:

Copy
Copied!
            

python main.py

This will generate and launch a job via Helm in the default namespace which can be viewed with helm show or kubectl get pods. The logs can be followed with kubectl logs <pod-name> for the first pod deployed for the job.

Common

Set the data preparation job’s configuration for GPT models in the YAML file:

Copy
Copied!
            

run: name: download_gpt3_pile results_dir: ${base_results_dir}/${.name} time_limit: "4:00:00" dependency: "singleton" node_array_size: 30 array: ${..file_numbers} bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster. dataset: pile download_the_pile: True # Whether to download the pile dataset from the internet. the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from. file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download. preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise. download_vocab_url: "https://huggingface.co/gpt2/resolve/main/vocab.json" # URL to download the vocab from. download_merges_url: "https://huggingface.co/gpt2/resolve/main/merges.txt" # URL to download the merges from. vocab_save_dir: ${data_dir}/bpe merges_save_dir: ${data_dir}/bpe tokenizer_type: GPT2BPETokenizer rm_downloaded: True # Extract script will remove downloaded zst after extraction rm_extracted: True # Preprocess script will remove extracted files after preproc.

Data Preparation for T5 Models

The data_preparation configuration in conf/config.yaml specifies the file to use for data preparation configuration. The data_preparation configuration must be specified as t5/download_t5_pile for preparing the Pile dataset for T5 models. The configuration file is at conf/data_preparation/t5/download_t5_pile.yaml.

GPT models and T5 models use different tokenizer and vocab files. The default values are in the corresponding configuration files.

To download a reduced portion of the dataset for testing, set the file_numbers configuration to "0" to download only shard 0. (The value must be a combination of numbers representing shards and hyphenated ranges representing ranges of shards, separated by commas (","). For example, this setting downloads and prepares files 0, 3, 5, 6, and 7:

Copy
Copied!
            

file_numbers="0,3,5-7"

Slurm

First, ensure that the cluster configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then make any required changes to time_limit and other configurations related to the job in download_t5_pile.yaml for T5 models.

You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.

Example

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - data_preparation: True

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the vocab and merge files in the same workspace for later use.

You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.

You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws workspace’s permissions.

To run the data preparation pipeline for T5 models, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py data_preparation=t5/download_t5_pile \ stages=<data_preparation> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results data_preparation.file_numbers='0-29' \ data_preparation.vocab_save_dir=/mount/data/bpe >> /results/data_t5_log.txt 2>&1

The command above assumes that you want to prepare the entire dataset (files 0-29), and that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_t5_log.txt, which you can download from NGC. Any other required configuration can be added to modify the command’s behavior.

Common

Set the data preparation job’s configuration for T5 models in the YAML file:

Copy
Copied!
            

dataset: pile download_the_pile: True # Whether to download the pile dataset from the internet. the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from. file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download. preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise. download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt" # URL to download the vocab from. download_merges_url: null vocab_save_dir: ${data_dir}/bpe merges_save_dir: ${data_dir}/bpe tokenizer_type: BertWordPieceCase # T5 models use BertWordPieceCase tokenizer log_dir: ${base_results_dir}/data_preparation/t5_pile_logs # Where to save the logs rm_downloaded: True # Extract script will remove downloaded zst after extraction rm_extracted: True # Preprocess script will remove extracted files after preproc. nodes: 30 time_limit: "4:00:00" bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

Data Preparation for mT5 Models

The data_preparation configuration in conf/config.yaml specifies which file to use for data preparation configuration. The data_preparation configuration must be specified as download_mc4 to prepare the mC4 dataset for mT5 models. The configuration file can be found in conf/data_preparation/download_mc4.yaml. mT5 models use the SentencePiece multilingual tokenizer.

To download a reduced portion of the dataset to run tests, set the languages configuration to download only one of the languages by changing it to lv. The list of all 101 languages can be found in the mC4 dataset.

Parallelize data preparation by using multiple nodes (the default is 20 nodes) to download and preprocess all language files in parallel.

Slurm

First, ensure that the cluster configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then make any required changes to time_limit and other configurations related to the job in download_mc4.yaml for mT5 models.

Example

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - data_preparation

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the tokenizer model file in the same workspace for later use.

The data preparation code must be launched in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.

Download the dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws workspace’s permissions.

To run the data preparation pipeline for mT5 models, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py data_preparation=mt5/download_mc4 \ stages=<data_preparation> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \ base_results_dir=/mount/results data_preparation.languages=\'cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh\' \ data_preparation.run.node_array_size=20 data_preparation.run.workers_per_node=4 >> /results/data_mt5_log.txt 2>&1

The command above assumes that you want to prepare the mC4 dataset with 24 languages, and that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_mt5_log.txt, which you can download from NGC. Any other required configuration can be added to modify the command’s behavior.

The full dataset may not fit into BCP workspaces. NVIDIA recommends using a smaller subset of languages (e.g. cs,da,de,el,fr,hi has total size 1TB). Any other configuration can also be added to the command to modify its behavior.

Common

Set the data preparation job’s configuration for mT5 models in the YAML file:

Copy
Copied!
            

run: name: download_mc4 results_dir: ${base_results_dir}/${.name} time_limit: "24:00:00" dependency: "singleton" node_array_size: 20 cpus_per_node: 256 workers_per_node: 4 # Number of workers per node in preprocessing step. dataset: mc4 download_mc4: True # Whether to download the mC4 dataset from the internet. preprocess_data: True # True to preprocess the data from a json.gz file, False otherwise. mc4_dir: ${data_dir}/mc4 # Path to (m)C4 dataset repo. git_lfs_dir: ${.mc4_dir}/lfs # Path to store git lfs files. download_vocab_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.vocab # URL to download the vocab from. download_tokenizer_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.model # URL to download tokenizer from vocab_save_dir: ${.mc4_dir}/bpe tokenizer_save_dir: ${.mc4_dir}/bpe tokenizer_model: ${.tokenizer_save_dir}/mt5_tokenizer.model languages: cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh # language list in mC4 dataset to download and preprocess. Use `all` to download and preprocess all languages or specify language list as `en,es,ko,zh,...` use_cleaned_english: True # whether to use cleaned version of english data softlinks_dir: ${.mc4_dir}/softlinks # Path to languages soft links for preprocessing preprocessed_dir: ${.mc4_dir}/preprocessed max_split_size: 200 # (GB) Each split will be preprocessed individually. Tune this down to accommodate short wall time on clusters download_worker_mapping: ${.mc4_dir}/download_mapping preprocess_worker_mapping: ${.mc4_dir}/preprocess_mapping rm_downloaded: False # Script will not remove downloaded after preprocessing

Data Preparation for BERT Models

The data_preparation configuration in conf/config.yaml specifies which file to use for data preparation configuration purposes. The default value is set to download_bert_pile, which can be found in conf/data_preparation/download_bert_pile.yaml. It is used to download, extract, and preprocess the Pile dataset for BERT model. Modify the configurations to perform different tasks and to decide where to store the datasets, vocab, etc.

To download a reduced portion of the dataset to run tests, you can set the file_numbers configuration to download only one of the shards by changing the value from "0-29" to "0" (the syntax must be a combination of numbers separated by dashes “-” or commas “,”) For example, file_numbers=“0,3,5-7” will download and prepare files 0, 3, 5, 6, and 7.

Slurm

First, ensure that the cluster configuration in conf/cluster/bcm.yaml is correct. Set the cluster and cluster_type configurations in conf/config.yaml to bcm. Then make any required changes to time_limit and other configurations related to the job in download_bert_pile.yaml for BERT models. You can parallelize data preparation by using up to 30 nodes to download all 30 files in parallel.

Example

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - data_preparation

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the data preparation script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This configuration can be overridden from the command line using hydra.

By default, the data preparation script downloads the data into the data/ directory. NVIDIA recommends that you set the data_dir configuration to a workspace, making the data visible across multiple jobs later on. Store the vocab and merge files in the same workspace for later use.

You must launch the data preparation code in a multi-node job. To speed up dataset preparation, you can parallelize it to use between 2 and 30 nodes.

You can download the 700+ GB dataset once and share it with multiple users in the same ACE by setting the nemo_megatron_data_ws workspace’s permissions.

To run the data preparation pipeline for BERT models, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<data_preparation> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_bert \ base_results_dir=/mount/results data_preparation.file_numbers='0-29' \ data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_bert_log.txt 2>&1

The command above assumes that you want to prepare the entire dataset (files 0-29), and that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_bert_log.txt, which you can download from NGC. You can add any other configuration required to modify the command’s behavior.

Common

Set the data preparation job’s configuration for BERT models in the YAML file:

Copy
Copied!
            

run: name: download_bert_pile results_dir: ${base_results_dir}/${.name} time_limit: "4:00:00" dependency: "singleton" node_array_size: 30 array: ${..file_numbers} bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster. dataset: pile download_the_pile: True # Whether to download the pile dataset from the internet. the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/" # Source URL to download The Pile dataset from. file_numbers: "0-29" # The pile dataset consists of 30 files (0-29), choose which ones to download. preprocess_data: True # True to preprocess the data from a jsonl file, False otherwise. download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt" # URL to download the vocab from. vocab_save_dir: ${data_dir}/bpe tokenizer_type: BertWordPieceLowerCase rm_downloaded: True # Extract script will remove downloaded zst after extraction rm_extracted: True # Preprocess script will remove extracted files after preproc.

© Copyright 2023, NVIDIA. Last updated on Dec 6, 2023.