Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

NeMo Framework Foundation Model Pre-training

Project Description

Learning Goals

This module focuses on successfully launching a foundation model pretraining job on your infrastructure and to get the necessary training artifacts as the output of the successful runs. The objective of this playbook is:

  • To understand and execute the workflow of pretraining Foundation Models (GPT3, T5, mT5, Bert) using NeMo Framework

  • To successfully produce the training artifacts such as checkpoints, logs, and event files using NeMo Framework

NeMo Tools and Resources

Data

  • For this module, we can use the Pile dataset to launch model training. We do not need the entire dataset and you can download a few shards for this task. The instructions for downloading the dataset are explained in the Project Instructions and Milestones section.

  • The dataset should have .bin and .idx files and well as the relevant tokenizer files (vocab and merge files) to successfully initialize the training job. This is accomplished as part of the data download steps outlined in the Project Instructions and Milestones section.

Requirements

Software

  • Access to latest NeMo Framework NGC Containers

  • DGX SuperPOD SW stack: Base Command Manager, Slurm (with pyxis plugin)

Hardware

  • Minimum: 2 DGX nodes (A100 or H100). For reference, a 5B model takes 9.8 days to train on 8 DGX A100s. The goal of this playbook is to launch and monitor the jobs successfully and not necessarily to train the model to convergence. It is therefore advised to have at least 2 nodes to establish multi-node communication required for training large language models.

  • Recommended: 10-20 nodes on DGX SuperPod or DGX Cloud

Project Instructions and Milestones

Select a model type (from GPT-3, T5, mT5 and Bert) and model size (such as 5B, 20B, etc. based on your available compute) and proceed to perform the following steps. You can select the smallest model GPT3 126M for this playbook if you would like.

Step 1: Download the NeMo Framework Training Container and Prepare the Environment

The whole solution uses a set of Docker containers executed on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. For this playbook, we will use a Slurm cluster and the NeMo Framework training container.

Go to NeMo Framework Training | NVIDIA NGC to get the latest NeMo Framework training container.

The NeMo Framework codebase is included as part of the training container. To download the container and copy the contents to a local directory in the cluster, the following command can be executed: (enter the path to the local directory based on your setup and always select the latest container tag):

srun -p <partition> -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=<container_tag> bash -c "cp -r /opt/NeMo-Framework-Launcher/launcher_scripts /opt/NeMo-Framework-Launcher/auto_configurator /workspace/mount_dir/"

After that, install the NeMo Framework scripts dependencies on the head node of the cluster::

pip install -r requirements.txt

Step 2: Setting up cluster configuration in the launcher scripts

  • If using a Slurm based cluster, the config file located at conf/cluster/bcm.yaml has the parameters to set the generic cluster related information, such as the partition or account parameters.

  • Go to /launcher_scripts/conf/cluster/bcm.yaml

  • Modify the cluster configuration based on your system configuration. Name the cluster partition, account, job_name_prefix and other arguments for your cluster.

Step 3: Changing config.yaml to run different stages of LLM training such as data preparation, training, fine-tuning, evaluation, etc.

  • After setting the cluster config, the first parameter that must be set is the launcher_scripts_path parameter inside the conf/config.yaml file. This parameter must point to the absolute path where the launcher_scripts repository is stored in the file system.

  • The launcher_scripts_path parameter will automatically be mounted to the container at the same path as in the local file system. The data_dir parameter can also be modified to point to where the dataset will be loaded from or saved. The base_results_dir can also be modified to point to where the results, checkpoints and logs will be stored. These last two parameters will be automatically mounted into the container. The parameters cluster and cluster_type must be set to bcm for all the tasks.

  • Within conf/config.yaml, set the defaults at the top of the file corresponding to the model of choice. For example, if you would like to perform training and checkpoint conversion to .nemo (which is required for all the subsequent tasks such as fine-tuning, PEFT, etc.) here is an example of defaults for 5B GPT-3 model. Please change these for your specific model choice.

../_images/config_sample_0.png

Step 4: Data Preparation

  • Within conf/config.yaml, in the stages section, specify the model development stages that you would like to run. To run data_preparation, you will need to specify it as such in the “stages” section as shown here:

../_images/config_sample_1.png
  • After specifying “data_preparation” as a “stage” in config.yaml, as well as gpt3/download_gpt3_pile in the “defaults” section of config.yaml, modify the data preparation parameters in the /conf/data_preparation/gpt3/download_gpt3_pile.yaml file.

  • Go to /launcher_scripts and run the following::

    python3 main.py
    

Output:

To check whether this was successful, head over to the data directory (data_dir) specified in the config.yaml. Depending on how many shards of Pile were downloaded, you should expect to see files such 00.jsonl.zst, 01.jsonl.zst, and so on. As well as a directory “bpe” which contains merges.txt and vocab.json

Step 5: Submitting a training job successfully

  • To initiate model training, go to conf/config.yaml, and within the stages section, uncomment “training” as shown in the screenshot below. You can add more stages to this section as your project demands. Each stage will be executed once the prior stage has successfully concluded.

../_images/config_sample_2.png
  • After setting the stage, go to /conf/training followed by the model of your choice and navigate to the appropriate config file corresponding to the model size you are interested in. Make sure that the same has been specified in the “defaults” section of the config.yaml file. For example, for the GPT-3 5B model, proceed to /conf/training/gpt3/5b.yaml and modify the yaml file based on your project.

  • Corresponding to each stage specified here, you would then go to /conf directory, and select the appropriate directory corresponding to the stage (for ex. training and data_preparation in this case) and modify the corresponding config files.

  • After setting the appropriate training yaml file, go to /launcher_scripts and run the following::

    python3 main.py
    

The expected output from this stage is explained in the next step.

Step 6: Checking errors, log files and training artifacts

  • Once the training job is successfully submitted, check for a successful run by monitoring the log files, error files as well as the checkpoints – all of which can be found in the base_results_dir specified in /conf/config.yaml

  • For this sample run, we launched a gpt3 5B parameter model. Accordingly, within the results folder, you can find a gpt3_5b folder created. It contains the log files and error files as can be seen in the file name.

  • Within gpt3_5b, go to /results/ to locate the Tensorboard event files, hyperparameter records, and logs for the training runs.

  • Within the gpt3_5b folder, go to /results/checkpoints/ to locate the checkpoints in the .ckpt format.

Step 7: Restoring a checkpoint and relaunching a training run

  • To launch a training run from an already existing checkpoint, make sure that the tensor parallel and pipeline parallel hyperparameters are the same as what was set to get this checkpoint. Also make sure that the global batch size remains unchanged to preserve the same training behavior.

  • To resume pretraining from an already existing checkpoint, go to /conf/training and direct to the appropriate folder of your model of choice. In our example, we will go to /gpt3/ and within that we open the 5b.yaml file

  • Inside the 5b.yaml, look for model section, and specify the location of the checkpoint that you would like to restore from, as shown in this figure: (the hyperparamters shown below may not correspond to the model of your choice).

../_images/config_sample_3.png
  • After specifying the restore location, go back to /launcher_scripts and run python3 main.py to relaunch the training job.

Project Deliverables

  • Submit a screenshot of the results folder showing checkpoints, logs, and error files as well as the tensorboard event files

  • Submit a screenshot of the loss curves as shown in the tensorboard summaries or the W&Bs summaries based on your preference.

Project Evaluation

This project will be evaluated based on the following criteria:

  • Ability to successfully launch a chosen model type and model size on your cluster and get the training artifacts.

  • Ability to explain the workflow for all 4 model types: GPT-3, T5, mT5 and Bert

  • Ability to modify the config files appropriately based on the requirements of specific models

  • Ability to explain all the hyperparameters listed in all the config files that are modified to launch the different stages of LLM development

  • Ability to save and restore checkpoints i.e. re-launch a training session based on existing checkpoint

  • Ability to explain performance of a training run based on the metrics shown in the log files

  • Ability to explain the debugging process in case of any errors

  • Ability to change the cluster configuration for different infrastructure settings