Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
NeMo Framework Foundation Model Pre-training
Project Description
Learning Goals
This module focuses on successfully launching a foundation model pretraining job on your infrastructure and to get the necessary training artifacts as the output of the successful runs. The objective of this playbook is:
To understand and execute the workflow of pretraining Foundation Models (GPT3, T5, mT5, Bert) using NeMo Framework
To successfully produce the training artifacts such as checkpoints, logs, and event files using NeMo Framework
NeMo Tools and Resources
Data
For this module, we can use the Pile dataset to launch model training. We do not need the entire dataset and you can download a few shards for this task. The instructions for downloading the dataset are explained in the Project Instructions and Milestones section.
The dataset should have .bin and .idx files and well as the relevant tokenizer files (vocab and merge files) to successfully initialize the training job. This is accomplished as part of the data download steps outlined in the Project Instructions and Milestones section.
Requirements
Software
Access to latest NeMo Framework NGC Containers
DGX SuperPOD SW stack: Base Command Manager, Slurm (with pyxis plugin)
Hardware
Minimum: 2 DGX nodes (A100 or H100). For reference, a 5B model takes 9.8 days to train on 8 DGX A100s. The goal of this playbook is to launch and monitor the jobs successfully and not necessarily to train the model to convergence. It is therefore advised to have at least 2 nodes to establish multi-node communication required for training large language models.
Recommended: 10-20 nodes on DGX SuperPod or DGX Cloud
Project Instructions and Milestones
Select a model type (from GPT-3, T5, mT5 and Bert) and model size (such as 5B, 20B, etc. based on your available compute) and proceed to perform the following steps. You can select the smallest model GPT3 126M for this playbook if you would like.
Step 1: Download the NeMo Framework Training Container and Prepare the Environment
The whole solution uses a set of Docker containers executed on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. For this playbook, we will use a Slurm cluster and the NeMo Framework training container.
Go to NeMo Framework Training | NVIDIA NGC to get the latest NeMo Framework training container.
The NeMo Framework codebase is included as part of the training container. To download the container and copy the contents to a local directory in the cluster, the following command can be executed: (enter the path to the local directory based on your setup and always select the latest container tag):
srun -p <partition> -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=<container_tag> bash -c "cp -r /opt/NeMo-Framework-Launcher/launcher_scripts /opt/NeMo-Framework-Launcher/auto_configurator /workspace/mount_dir/"
After that, install the NeMo Framework scripts dependencies on the head node of the cluster::
pip install -r requirements.txt
Step 2: Setting up cluster configuration in the launcher scripts
If using a Slurm based cluster, the config file located at
conf/cluster/bcm.yaml
has the parameters to set the generic cluster related information, such as the partition or account parameters.Go to
/launcher_scripts/conf/cluster/bcm.yaml
Modify the cluster configuration based on your system configuration. Name the cluster partition, account, job_name_prefix and other arguments for your cluster.
Step 3: Changing config.yaml to run different stages of LLM training such as data preparation, training, fine-tuning, evaluation, etc.
After setting the cluster config, the first parameter that must be set is the launcher_scripts_path parameter inside the
conf/config.yaml
file. This parameter must point to the absolute path where the launcher_scripts repository is stored in the file system.The
launcher_scripts_path
parameter will automatically be mounted to the container at the same path as in the local file system. Thedata_dir
parameter can also be modified to point to where the dataset will be loaded from or saved. Thebase_results_dir
can also be modified to point to where the results, checkpoints and logs will be stored. These last two parameters will be automatically mounted into the container. The parameters cluster and cluster_type must be set to bcm for all the tasks.Within
conf/config.yaml
, set the defaults at the top of the file corresponding to the model of choice. For example, if you would like to perform training and checkpoint conversion to .nemo (which is required for all the subsequent tasks such as fine-tuning, PEFT, etc.) here is an example of defaults for 5B GPT-3 model. Please change these for your specific model choice.
Step 4: Data Preparation
Within
conf/config.yaml
, in the stages section, specify the model development stages that you would like to run. To run data_preparation, you will need to specify it as such in the “stages” section as shown here:
After specifying “data_preparation” as a “stage” in
config.yaml
, as well asgpt3/download_gpt3_pile
in the “defaults” section ofconfig.yaml
, modify the data preparation parameters in the /conf/data_preparation/gpt3/download_gpt3_pile.yaml file.Go to
/launcher_scripts
and run the following::python3 main.py
Output:
To check whether this was successful, head over to the data directory (data_dir) specified in the config.yaml. Depending on how many shards of Pile were downloaded, you should expect to see files such 00.jsonl.zst, 01.jsonl.zst, and so on. As well as a directory “bpe” which contains merges.txt and vocab.json
Step 5: Submitting a training job successfully
To initiate model training, go to
conf/config.yaml
, and within the stages section, uncomment “training” as shown in the screenshot below. You can add more stages to this section as your project demands. Each stage will be executed once the prior stage has successfully concluded.
After setting the stage, go to
/conf/training
followed by the model of your choice and navigate to the appropriate config file corresponding to the model size you are interested in. Make sure that the same has been specified in the “defaults” section of theconfig.yaml
file. For example, for the GPT-3 5B model, proceed to/conf/training/gpt3/5b.yaml
and modify the yaml file based on your project.Corresponding to each stage specified here, you would then go to
/conf
directory, and select the appropriate directory corresponding to the stage (for ex. training and data_preparation in this case) and modify the corresponding config files.After setting the appropriate training yaml file, go to
/launcher_scripts
and run the following::python3 main.py
The expected output from this stage is explained in the next step.
Step 6: Checking errors, log files and training artifacts
Once the training job is successfully submitted, check for a successful run by monitoring the log files, error files as well as the checkpoints – all of which can be found in the base_results_dir specified in
/conf/config.yaml
For this sample run, we launched a gpt3 5B parameter model. Accordingly, within the results folder, you can find a
gpt3_5b
folder created. It contains the log files and error files as can be seen in the file name.Within gpt3_5b, go to
/results/
to locate the Tensorboard event files, hyperparameter records, and logs for the training runs.Within the gpt3_5b folder, go to
/results/checkpoints/
to locate the checkpoints in the .ckpt format.
Step 7: Restoring a checkpoint and relaunching a training run
To launch a training run from an already existing checkpoint, make sure that the tensor parallel and pipeline parallel hyperparameters are the same as what was set to get this checkpoint. Also make sure that the global batch size remains unchanged to preserve the same training behavior.
To resume pretraining from an already existing checkpoint, go to
/conf/training
and direct to the appropriate folder of your model of choice. In our example, we will go to /gpt3/ and within that we open the 5b.yaml fileInside the 5b.yaml, look for model section, and specify the location of the checkpoint that you would like to restore from, as shown in this figure: (the hyperparamters shown below may not correspond to the model of your choice).
After specifying the restore location, go back to
/launcher_scripts
and runpython3 main.py
to relaunch the training job.
Project Deliverables
Submit a screenshot of the results folder showing checkpoints, logs, and error files as well as the tensorboard event files
Submit a screenshot of the loss curves as shown in the tensorboard summaries or the W&Bs summaries based on your preference.
Project Evaluation
This project will be evaluated based on the following criteria:
Ability to successfully launch a chosen model type and model size on your cluster and get the training artifacts.
Ability to explain the workflow for all 4 model types: GPT-3, T5, mT5 and Bert
Ability to modify the config files appropriately based on the requirements of specific models
Ability to explain all the hyperparameters listed in all the config files that are modified to launch the different stages of LLM development
Ability to save and restore checkpoints i.e. re-launch a training session based on existing checkpoint
Ability to explain performance of a training run based on the metrics shown in the log files
Ability to explain the debugging process in case of any errors
Ability to change the cluster configuration for different infrastructure settings