Run NeMo Framework on DGX Cloud

DGX Cloud is a distributed, serverless platform for training and deploying AI models. Developers interact with Base Command Platform (BCP) which runs in front of DGX Cloud and allows a unified platform for managing code, datasets, and results, while also making it easy to monitor job status and performance.

NeMo Framework includes native support for BCP to allow developers to quickly get started training and fine-tuning Large Language Models (LLMs). Prior to starting a new project, a few steps need to be completed in preparation. This document serves as a guide for preparing a dataset and pre-training a foundational model with NeMo Framework on DGX Cloud.

Pre-Requisites

In order to use the training container, you will need access to the NeMo Framework General Availability (GA). This requires registering for the GA at this link. You will receive an email confirmation once your application is approved.

Additionally, you will need access to an ACE in DGX Cloud with at least 4 nodes. You will also need at least 1.5TB of storage quota, the ability to push images to the org’s private container registry, permission to create workspaces, and the latest version of the ngc cli as well as Docker installed on a local computer.

Steps

Training a base LLM has three main stages - setup, data preparation, and model pre-training. The following sections will walk through each step.

Setup

First, identify the org ID for the ACE you are wanting to test on. This will be unique to each cluster. The easiest way to find the org ID is to sign in to NGC and select the specified org. Then, start to create a new job on the job creation page and scroll down to the bottom of the page (ignore all of the fields for now) to see the sample CLI command. This will show your org ID like in the image below (highlighted in the red box).

Example Org ID on BCP

Make a note of your org ID as this will be referenced throughout the remainder of the document as <ORG ID>.

Pushing the container

The NeMo Framework training container is private to GA customers only and is not available in new ACEs by default. The container needs to be pulled from the NeMo Framework private GA registry, tagged with the org ID, then pushed to the org’s private registry for the ACE to access it. The following steps walk through this process. These steps should be run on the local computer with ngc cli and Docker installed.

An NGC authentication key is required which can be generated in the web UI. To create a private key, login to NGC at ngc.nvidia.com in a web browser. After signing in, click your account name in the top right corner of the page followed by the “Setup” button. In the Setup page, click the “Get API Key” button. Clicking the “Generate API Key” button in the top right will create a new API key that can be used. Note that this key will only be shown once so it is highly recommended to store it in a safe location. This key can be used on multiple devices, so don’t assume you will only need it once.

Authenticate with nvcr.io via Docker (username must be exactly $oauthtoken and should not be replaced with any usernames):

Copy
Copied!

            
            $ docker login nvcr.io
Username: $oauthtoken
Password: < Insert private NGC key here >

Pull the latest training container (check for the latest available image tag and replace <tag> with it for all of the following commands):
Copy

Copied!
```
            
            $ docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-training:<tag>
        
```

Tag the training container with your <ORG ID> (replace <ORG ID> with your org ID found earlier):

Copy
Copied!

            
            $ docker tag nvcr.io/ea-bignlp/ga-participants/nemofw-training:<tag> nvcr.io/<ORG ID>/nemofw-training:<tag>

Push the container to the org’s private registry:

Copy
Copied!

            
            $ docker push nvcr.io/<ORG ID>/nemofw-training:<tag>

You should now see this container in the org’s private registry in the web UI at https://registry.ngc.nvidia.com/containers.

Container after being pushed to ACE

Creating a workspace

A workspace is needed to store dynamic code and intermediate data. NeMo Framework requires the launcher repository to be copied to a workspace and mounted during jobs. The workspace will contain the launching code, model configurations, datasets, and checkpoints.

To create a workspace, run the following command from a local computer with ngc cli installed (bolded items are explained below):

Copy
Copied!

            
            ngc workspace create --name nemo-framework \
    --org <ORG ID> \
    --team team-name \
    --ace nv-dgxc-ace1

nemo-framework: This is the name of the workspace. You are welcome to replace the workspace name here but nemo-framework will be used for the workspace name for the remainder of the document.
<ORG ID>: Replace with your org ID captured earlier.
team-name: If using a team within your org, specify the team name here or no-team if not using a team.
nv-dgxc-ace1: Replace with the name of the ACE you will be running on.

To verify the workspace has been created successfully, run ngc workspace list | grep nemo-framework (replacing the workspace name as needed) which should contain the newly created workspace.

Mounting the workspace locally

Once the workspace has been created, you can mount it locally on your machine. This is required to update config files and view checkpoints. To mount the workspace, run the following command:

Copy
Copied!

            
            ngc workspace mount --org <ORG ID> \
    --team team-name \
    --ace nv-dgxc-ace1 \
    --mode RW \
    nemo-framework \
    local-mount-directory

local-mount-directory: Replace this with the path to mount the workspace locally on your computer. The workspace will be mounted in this directory.

After mounting the workspace, cd into the directory and view the contents. At this point it should be empty since no files have been written to it yet.

Data Preparation

Pre-training a GPT-3 model requires a text-based dataset to be downloaded and pre-processed for NeMo Framework to ingest the data optimally. The Pile is often used as the dataset for pre-training models. NeMo Framework contains helper scripts to download and preprocess the dataset. The following steps outline how to download and pre-process the dataset on DGX Cloud with an explanation of key points after.:

Copy
Copied!

            
            ngc batch run \
    --name "gpt3-dataprep-create-dataset" \
    --org <ORG ID> \
    --team team-name \
    --ace nv-dgxc-ace1 \
    --instance dgxa100.80g.8.norm \
    --image "nvcr.io/<ORG ID>/nemofw-training:<tag>" \
    --result /results \
    --workspace nemo-framework:/mount_workspace:RW \
    --total-runtime 10h \
    --array-type "PYTORCH" \
    --replicas "2" \
    --commandline "\
    set -x && \
    mkdir -p /mount_workspace/data/bpe && \
    wget https://huggingface.co/gpt2/resolve/main/vocab.json -O /mount_workspace/data/bpe/vocab.json && \
    wget https://huggingface.co/gpt2/resolve/main/merges.txt -O /mount_workspace/data/bpe/merges.txt && \
    python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py \
        cluster_type=bcp \
        stages=[data_preparation] \
        launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \
        data_dir=/mount_workspace/data \
        base_results_dir=/mount_workspace/results \
        data_preparation.run.node_array_size=\${NGC_ARRAY_SIZE} \
        data_preparation.the_pile_url=https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/ \
        data_preparation.file_numbers='0-29' \
        data_preparation.rm_downloaded=True \
        data_preparation.rm_extracted=True \
        data_preparation.tokenizer_type=GPT2BPETokenizer"

ngc batch run: This command indicates a job should be run on DGX Cloud with the specific parameters/command as indicated by the flags. After the command is submitted, job status will be shown in the terminal and the job will be entered into the queue.
Instance: This indicates which type of node(s) should be allocated for the particular job. In this case, the dgxa100.80g.8.norm instance was chosen. This is a DGX A100 system with 8x 80GB A100 GPUs. At present, this is the only option for multi-node jobs on DGX Cloud. For single-node jobs, fewer GPU instances can be allocated if desired.
Image: This is the container image that was pushed earlier in the guide. This must match the full image name and tag that was pushed earlier and this image must exist in the private registry for the org.
Workspace: This is the name of the workspace that was created earlier and includes the location to mount the workspace (/mount_workspace in this case) inside the container for the job. This workspace will be used throughout the jobs in this guide so all code, datasets, and checkpoints will be consistent.
Array Type: This indicates how multi-node jobs should be launched. In this case, PYTORCH is chosen since NeMo Framework expects jobs to be launched with PyTorch distributed modules. For multi-node jobs that don’t use PyTorch distributed, the MPI array type is also available.
Replicas: This is the number of nodes to run on. In this case, the job will run on 2 nodes. If more nodes are available, this number can be increased to speed up the data preparation process. For example, if 4 nodes are available in the cluster, replace this value with “4” and the preprocessing step will run twice as fast.

After executing the command above, a job will be submitted to the queue in your ACE and start once resources become available. Once resources are allocated, a job will launch on the desired number of nodes each running the specified NeMo FW container. On each node, The Pile dataset will be downloaded from the internet and preprocessed using the GPT2 BPE tokenizer from HuggingFace. The dataset will be saved to /mount_workspace/data inside the container which is mapped to data/ inside the workspace that was created.

For two nodes, the job will typically take around 8 hours to complete. As mentioned above, if additional nodes were requested while launching the job, this time will be shorter.

Progress can be viewed in the NGC web UI in the Jobs tab from the left menu of the Base Command section. The job will likely be one of the first ones listed. Once the job has a status of Finished Success, you may proceed to training the model.

To verify the data was downloaded and preprocessed correctly, run du -sh data/* inside your mounted workspace locally. This should include 30 .bin files similar to my-gpt3_nn_text_document.bin (replacing nn with a 2-digit number ranging from 00 to 29), each being around 24GB in size as well as a corresponding .idx file for each name at about 134MB.

Model Training

Once the data has been prepared, it is time to train a model. NeMo Framework contains many predefined configuration files for various models including the 5 billion parameter GPT-3 model. This section will demonstrate how to initiate training a 5B model on a variable number of nodes.:

Copy
Copied!

            
            ngc batch run \
    --name "gpt3-training-5b-bf16" \
    --org <ORG ID> \
    --team team-name \
    --ace nv-dgxc-ace1 \
    --instance dgxa100.80g.8.norm \
    --image "nvcr.io/<ORG ID>/nemofw-training:<tag>" \
    --result /results \
    --workspace nemo-framework:/mount_workspace:RW \
    --total-runtime 5D \
    --replicas 4 \
    --array-type PYTORCH \
    --commandline "\
    set -x && \
    python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py \
        cluster_type=bcp \
        stages=[training] \
        training=gpt3/5b \
        training_config=gpt3/5b \
        launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \
        data_dir=/mount_workspace/data \
        base_results_dir=/mount_workspace/results \
        training.run.time_limit=\"5-00:00:00\" \
        training.trainer.max_time=\"4:23:30:00\" \
        training.trainer.num_nodes=\${NGC_ARRAY_SIZE} \
        training.model.tokenizer.vocab_file=/mount_workspace/data/bpe/vocab.json \
        training.model.tokenizer.merge_file=/mount_workspace/data/bpe/merges.txt \
        > >(tee -a /results/train_log.log) \
        2> >(tee -a /results/train_stderr.log >&2) && \
    rsync -P -rvh /mount_workspace/results /results"

Total Runtime: BCP will automatically kill any jobs that surpass a predefined time-limit. This job as specified will be terminated automatically after 5 days. If a different time limit is desired (ie. if a job should only run for 24 hours), specify it here.
Replicas: As with the data preparation, this value should be updated depending on the desired number of nodes to test on. Note that while the number can be tweaked, it needs to be cleanly divisible by the global batch size (GBS), micro batch size (MBS), tensor parallelism (TP), and pipeline parallelism (PP), where GBS % (MBS * num GPUs) / (PP * TP) == 0. For the 5B model, the default GBS is 2048, MBS is 4, TP is 1, and PP is 1. Therefore, a replica value of 4 is valid as 2048 % (4 * (8 * 4)) / (1 * 1) == 0.
training.run.time_limit: Once this time limit is reached, NeMo Framework will automatically close itself. This is different from the Total Runtime above as this is not implemented by BCP and should theoretically involve a clean exit whereas the BCP kill will kill regardless of state. This can be set lower than the Total Runtime above if a clean exit is preferred.
training.trainer.max_time: Once this limit is reached, NeMo Framework will finish the current forward pass if necessary and begin saving a checkpoint. It is recommended to have this 30-60 minutes before the overall time limit to provide extra wiggle room to save a checkpoint reflecting the final state of the model. Once the checkpoint is saved, the training process will exit cleanly.

As mentioned, the training process exercises many of the hardware components involved in the cluster. Once the job begins, you can view the telemetry in the BCP web UI. The following screenshot is an example of the first 24 hours of a GPT-3 5B training session running on 8 nodes.

Telemetry after pre-training a 5B GPT-3 model for 24 hours

Training progress can be viewed in the Log section of the job page. This will display the percentage completion as well as the estimated time remaining to train the model. Note that the job may terminate prior to the model finishing training depending on the time limits specified. This is still valuable as the models should have a deterministic loss curve and can be compared for accuracy and if the job was run for long enough, can be used to certify the hardware is still healthy.

By default, after 2000 global steps, a checkpoint will be saved based on the current model state. For the 5B models, the checkpoint will be approximately 59GB each. The checkpoints can be viewed in the workspace mounted locally:

Copy
Copied!

            
            $ du -sh results/gpt3_5b/results/checkpoints/*
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=1.86-step=12000-consumed_samples=24571904.0.ckpt
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=1.86-step=12000-consumed_samples=24571904.0-last.ckpt
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=1.88-step=10000-consumed_samples=20475904.0.ckpt
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=1.92-step=8000-consumed_samples=16379904.0.ckpt
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=1.97-step=6000-consumed_samples=12285952.0.ckpt
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=2.05-step=4000-consumed_samples=8189952.0.ckpt
59G results/gpt3_5b/results/checkpoints/megatron_gpt--val_loss=2.26-step=2000-consumed_samples=4093952.0.ckpt

The above command indicates 6 checkpoints were saved with an additional copy of the latest checkpoint including -last.ckpt in the name. In this case, the validation loss improved with every checkpoint indicating the model is progressing as expected. By default, up to 10 checkpoints with the lower val_loss score will be saved with lower scores replacing higher ones once that limit is reached. After every validation pass, the latest checkpoint will always be overwritten based on the current model regardless of the val_loss score. As such, in some scenarios the -last.ckpt file might not have the lowest val_loss score if it did not decrease.

Next Steps

With the base dataset pre-processed in the workspace and pre-training completed for the base model, any additional fine-tuning and deployment steps can be done by following the same pattern as above.

Each stage will involve launching a job with ngc batch run, monitoring the job status in the web UI, and verifying results in the web UI or via the mounted workspace.

Most every job will look similar to this template:

Copy
Copied!

            
            ngc batch run \
    --name "model-stage-name" \
    --org <ORG ID> \
    --team team-name \
    --ace nv-dgxc-ace1 \
    --instance dgxa100.80g.8.norm \
    --image "nvcr.io/<ORG ID>/nemofw-training:<tag>" \
    --result /results \
    --workspace nemo-framework:/mount_workspace:RW \
    --total-runtime 5D \
    --replicas <num replicas> \
    --array-type PYTORCH \
    --commandline "<Command here>"

Depending on the job being run, you might need fewer replicas than what was required for pre-training.

The commandline will also be forumlaic with the following template:

Copy
Copied!

            
            python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py \
        cluster_type=bcp \
        stages=[<Enter stage>] \
        launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \
        data_dir=/mount_workspace/data \
        base_results_dir=/mount_workspace/results \
        <Enter stage>_config=<Enter stage config file>
        ...

The Python script points to the launcher script inside the container when the job is launched. Enter which stage is desired in stages=[] and specify the config file to use in <Enter stage>_config=<Enter stage config file>, such as evaluation_config=gpt3/evaluate_all for running evaluation. You can manually override any additional settings from the config files as shown in the earlier examples - otherwise the default values will be used.