NVIDIA Docs Hub NVIDIA LaunchPad Train a Large-Scale NLP Model with NeMo Megatron Step #2: Download and Preprocess the Data

Step #2: Download and Preprocess the Data

For the purpose of this lab, all data will be downloaded and processed in the workspace that was created in step 1. The data could optionally be saved and linked in the dataset object on BCP. For additional background on using datasets, refer to working with datasets documentation located here.

Use BigNLP to Download and Prepare the Pile Dataset

The Pile consists of 22 smaller datasets. One has the option of launching using Base Command Platform web UI or NGC batch CLI. Refer to running jobs documentation here.

Regardless of whether the web UI or CLI is used, a job submission requires the following specifications: name, org, team, ace, instance, result, image, total-runtime, and commandline. These are required minimum specifications as there are other optional specs such as workspace mounts, datasets mounts, and more. The multinode jobs require additional options: replicas, and array-type. Again, refer to online documentation for a comprehensive list of options, or use the cli help.

Copy
Copied!

            
            ngc batch run --help
usage: ngc batch run [--ace <name>] [--array-type <type>] [--coscheduling]
           [--datasetid <id>] [--debug] [--entrypoint <entry>]
           [--format_type <fmt>] [--min-availability <num>]
           [--min-timeslice <t>] [--network <type>] [--org <name>]
           [--preempt <class>] [--replicas <num>] [--result <mntpt>]
           [--shell [CMD]] [--start-deadline <t>] [--team <name>]
           [--topology-constraint <specifier>] [--total-runtime <t>]
           [--use-image-entrypoint] [--waitend] [--waitrun] [-c <c>]
           [-d <desc>] [-f <file>] [-h] [-i <url>] [-in <type>] [-n <name>]
           [-p <port>] [-w <wkspce>]
. . .

The processing logic of the job submission is specified through commandline. That is, some application launch code. NeMo Megatron scripts are configured through Hydra (Hydra is a Python framework used for managing configurations of complex applications). The following is an example of the NeMo Megatron cli with config options to only download and preprocess data to be used for training.

Copy
Copied!

            
            mkdir -p /mount_workspace/dataprep && \
python3 /opt/bignlp/bignlp-scripts/main.py \
  run_data_preparation=True \
  run_training=False \
  run_conversion=False \
  run_evaluation=False \
  cluster_type=bcp \
  bignlp_path=/opt/bignlp/bignlp-scripts \
  data_dir=/mount_workspace/dataprep/data/the_pile \
  base_results_dir=/mount_workspace/dataprep/results \
  data_preparation.file_numbers='0-29' \
  data_preparation.vocab_save_dir=/mount_workspace/dataprep/data/bpe \
  data_preparation.merges_save_dir=/mount_workspace/dataprep/data/bpe

The above command needs to be submitted using NGC batch CLI. Alternatively, one could write a script with the above commands, or even write a script with the commandline. This script could be uploaded to a mounted workspace and commandline becomes just the invocation of the script from the mounted workspace. Throughout the document, the NGC batch CLI approach is used. The following is the full NGC command using jdoe_workspace. Job output will automatically be stored in the results/data_preparation directory in the workspace as well as the log output in the web UI. Run the following command to download and prepare the dataset while replacing the org, team, ace, image, and workspace values based on your environment.

Copy
Copied!

            
            ngc batch run --name "bignlp-dataprep-create-dataset" \
  --org nv-launchpad-bc \
  --team nvbc-bignlp \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.8.norm \
  --result /results \
  --array-type "PYTORCH" \
  --replicas "2" \
  --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --total-runtime 6h \
  --commandline "\
set -x && \
python3 /opt/bignlp/bignlp-scripts/main.py \
cluster_type=bcp \
run_data_preparation=True \
run_training=False \
run_conversion=False \
run_evaluation=False \
bignlp_path=/opt/bignlp/bignlp-scripts \
data_dir=/mount_workspace/data/the_pile \
base_results_dir=/mount_workspace/results \
data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount_workspace/data/bpe \
data_preparation.merges_save_dir=/mount_workspace/data/bpe"

The job takes about 5.5 hours using two nodes. Preprocessing logic can double in speed by doubling the number of nodes used.

The directory structure of data inside the workspace should look like this (abbreviated files list in the_pile subdirectory).

Copy
Copied!

            
            └── [4.0K]  data
    ├── [4.0K]  bpe
    │   ├── [446K]  merges.txt
    │   └── [1018K]  vocab.json
    └── [ 12K]  the_pile
        ├── [ 23G]  my-gpt3_00_text_document.bin
        ├── [134M]  my-gpt3_00_text_document.idx
        ├── [ 23G]  my-gpt3_01_text_document.bin
        ├── [134M]  my-gpt3_01_text_document.idx
        ├── .
        ├── .
        ├── .
        ├── [ 23G]  my-gpt3_28_text_document.bin
        ├── [134M]  my-gpt3_28_text_document.idx
        ├── [ 23G]  my-gpt3_29_text_document.bin
        └── [134M]  my-gpt3_29_text_document.idx

There should be 30 bin and 30 idx files within the_pile subdirectory with the bin files being approximately 23GB and the idx files being approximately 134MB in size.

Once the job is complete, results can be viewed in the web UI. Additionally, status can be captured using the ngc CLI with the following command (replace <JOB ID> with the job ID to capture results from):

Copy
Copied!

            
            ngc result info \
  --org nv-launchpad-bc \
  --team nvbc-bignlp \
  --ace nv-launchpad-bc-iad1 \
  <JOB ID>

This command should indicate the high level information including the size of results, total files, and when it was created.