2. NeMo Framework on DGX Cloud

2.1. Overview

This guide provides a basic starting point for using NVIDIA’s framework for pre-training, fine-tuning, and deploying Large Language Models (LLMs), called NeMo Framework.

The document walks through using a DGX Cloud Slurm cluster as a user to launch a simple pretraining job, targeting synthetic data to minimize dependencies for initial use.

2.2. Prerequisites

To follow this document, the following items are assumed to be true:

The user has a valid NGC key which can be generated by following these steps. Save this key for future steps during cluster setup.
A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (administrator permissions not necessary). More on cluster-specific requirements can be found later in the document.
The user has access to at least two A100 or H100-based compute nodes on the cluster.
The user has read/write access to at least 100GB of shared storage.
The user can install additional Python packages via pip on the login node (available by running module add python39 after logging in).
The user has the Slurm module configured as part of their account (typically configured by an admin at user creation time, but available by running module add slurm after logging in).

2.3. Setting Up Your Cluster Workspace

Before testing can begin, a few steps need to be taken to properly configure the user workspace for interacting with NeMo Framework. The following sections assume you are connected to a login node via SSH as the user you intended to run jobs as.

2.3.1. Authenticating with NGC

In order to pull the NeMo FW training container from NGC, the previously noted NGC API key needs to be added to a configuration file in the DGX Cloud Slurm cluster. Authorization can be provided by following the appropriate section in the DGX Cloud User Guide if not already completed.

2.3.2. Pulling the NeMo Framework repository

The NeMo Framework which is used to launch data prep and training jobs is available on GitHub. The repository can be pulled directly from GitHub onto the login node. The location where the repository is cloned to needs to be on a Lustre filesystem, which will be accessible on all compute nodes.

This location is cluster dependent and should have been provided to you during onboarding to the cluster. Ask cluster admins for this information if not available.

Note: if your cluster was set up according to the DGX Cloud Admin Guide, your shared storage will be located at /lustre/fs0/scratch/<user-name>.

Navigate to your user’s Lustre filesystem directory. Next, clone the repository from GitHub. A git reset command is used to ensure a match with the code tested as part of this guide.

cd /lustre/fs0/scratch/<user-name>
git clone https://github.com/nvidia/nemo-megatron-launcher
cd nemo-megatron-launcher
git reset --hard 51df3f36f5bc51b7bbdc3f540a43b01cdc28c8be

2.3.3. Configuring NeMo Framework

With the repository cloned, the Python dependencies used to launch the script need to be installed. To do so we need to load the python39 and slurm modules if not already loaded. DGX Cloud comes pre-configured with several modules relating to various applications like Python, Slurm, OpenMPI, etc. To load the python39 and slurm modules, run:

module add python39 slurm

You can also run the following, which will add these modules to your user profile automatically during future logins to the cluster.

module initadd python39 slurm

Install the Python dependencies for NeMo FW with the following. This assumes you are in the nemo-megatron-launcher directory that was cloned in a previous step.

pip3 install -r requirements.txt

NeMo Framework has a series of config files which are used to tailor training, fine-tuning, data prep, and more for your specific needs. The config files are located at launcher_scripts/conf inside the nemo-megatron-launcher directory. The main config file is launcher_scripts/conf/config.yaml which contains high-level configuration settings that will be used for all stages of NeMo FW. Open the config.yaml file and make the following edits:

Line 6: Change gpt3/5b to gpt3/7b_improved.
Line 32: Uncomment training. This indicates we want to run the training stage.
Line 33: Comment out conversion by adding # to the beginning of the line following the pattern of the other lines in this section. We do not want to run model conversion to convert the distributed checkpoint to the .nemo format at this time. For our performance and validation purposes we do not care about model conversion.
Line 44: Replace the ??? with the full path to the nemo-megatron-launcher/launcher_scripts directory. Again, this will be cluster dependent but must match the location of where the repository was cloned in the previous section. As an example, if your repository was cloned to /lustre/fs0/scratch/<user-name>/nemo-megatron-launcher you would enter /lustre/fs0/scratch/<user-name>/nemo-megatron-launcher/launcher_scripts. The path must end with launcher_scripts.
Line 48: replace null with /cm/shared.

In the env_vars section starting on line 56: DGX Cloud clusters need the following settings in order to use the compute network for optimal performance:

NCCL_TOPO_FILE: /cm/shared/etc/ndv4-topo.xml
UCX_IB_PCI_RELAXED_ORDERING: null
NCCL_IB_PCI_RELAXED_ORDERING: 1
NCCL_IB_TIMEOUT: null
NCCL_DEBUG: null
NCCL_PROTO: LL,LL128,Simple
TRANSFORMERS_OFFLINE: 0
TORCH_NCCL_AVOID_RECORD_STREAMS: 1
NCCL_NVLS_ENABLE: 0
NVTE_DP_AMAX_REDUCE_INTERVAL: 0
NVTE_ASYNC_AMAX_REDUCTION: 1
NVTE_FUSED_ATTN: 0
HYDRA_FULL_ERROR: 1
OMPI_MCA_coll_hcoll_enable: 0
UCX_TLS: rc
UCX_NET_DEVICES: mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
CUDA_DEVICE_ORDER: PCI_BUS_ID
NCCL_SOCKET_IFNAME: eth0
NCCL_ALGO: Tree,Ring,CollnetDirect,CollnetChain,NVLS
MELLANOX_VISIBLE_DEVICES: all
PMIX_MCA_gds: hash
PMIX_MCA_psec: native

Regardless of the cluster-specific settings to enable high-speed compute networking, set TRANSFORMERS_OFFLINE to 0 instead of 1. This will allow tokenizers to be downloaded from the internet if not found locally which is expected to be the case on new clusters.

In addition to the main config file, the cluster config might need to be updated. A DGX Cloud Slurm cluster with default configurations will not require modifications to this file. Specifically, this depends on how the cluster is configured and if there are any custom partitions or account names that need to be used. Open the cluster config file at launcher_scripts/conf/cluster/bcm.yaml. If the Slurm cluster has a custom non-default partition or account that jobs need to run on, specify those in the file on the account and partition lines.

2.4. Running a Training Job on Slurm Using Synthetic Data

Similar to the core config file there are a few training-specific config changes that need to be made for this sample synthetic data-based training job.

2.4.1. Configuring the Training Job

Next, we need to update some of the settings for the 7b_improved model. Open the model’s config file at launcher_scripts/conf/training/gpt3/7b_improved.yaml. Note that the launcher_scripts/conf/training/gpt3 directory contains all of the default configurations for the various model sizes that NVIDIA has validated. Make the following changes to the 7b_improved.yaml config file:

Line 8: Update this to a time_limit value of 0-02:00:00. This will end the test run at two hours.
Line 12: Update this to the number of nodes you want to test on. For example, if you have 4 nodes that you want to run this job on set this value to 4.
Line 20: Set the maximum number of steps to 2000. This is the number of steps the training will run for. The higher the number of steps, the more stable performance will be, though it will take longer to reach higher steps.
Line 21: Set the max_time value to 00:01:30:00. This will ensure the training run allows for 30 minutes of time in the overall training run to cleanly end the run and write a checkpoint to shared storage.
Line 23: Set the val_check_interval to 240. This is the number of steps the training job will take before running a validation pass. This number must be less than or equal to the maximum number of steps listed above.
Line 169: Change the data_impl value to from mmap to mock - this ensures synthetic data is generated and used.

2.4.2. Running the Training Job

After updating the training config file, it’s time to launch the training job.

Navigate to the nemo-megatron-launcher/launcher_scripts directory and run:

python3 main.py

This will queue up a training job for the 7b_improved model once resources are available.

2.4.3. Monitoring the Training Job

Once the training jobs are submitted, they can be viewed in the queue with squeue. The output should look similar to the following:

squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
181        defq nemo-meg demo-use  R       0:01      4 gpu[001-004]

The training logs of each job associated with a specific model can be found at nemo-megatron-launcher/launcher_scripts/results/<model name>/log-nemo-megatron-<model name>_NNN where NNN is the job number. There will be both a .out file and a .err file for the job’s stdout and stderr, respectively. For the sample we are running, <model name> will be gpt_7b_improved. To view the live progress, these output of these files can be followed with tail -f <filename>, which will display updates on the terminal as they are written to the file.

2.4.4. Interpreting Training Performance

While the model is training, progress will be updated at the bottom of the log files. Open the log files and you should see a line that looks similar to this at the end:

Epoch 0: :   2%|▏         | 180/10050 [40:46<37:16:04, v_num=m9yd, reduced_train_loss=5.640, global_step=179.0, consumed_samples=92160.0, train_step_timing in s=13.40]

This is the training progress bar and status information. Breaking down the line we have:

Epoch 0: This indicates we are in the first epoch of the training dataset.

2%: The job is 2% through the maximum number of steps we wanted to train for.

40:46<37:16:04: Training has run for 40 minutes and 46 seconds so far and it is expected to take another 37 hours to finish.

v_num=m9yd: This is the validation loss but since validation passes are only done every 2,000 steps by default, this is an uninitialized value.

reduced_train_loss=5.640: This is the latest training loss. Ideally this should decrease over time while training.

global_step=179.0: This is the number of steps that have been registered in the results. It roughly matches the steps listed earlier.

consumed_samples=92160.0: This is the number of samples in the dataset that have been processed so far. If multiplied by the sequence length, typically 2048, it equals the total number of tokens the model has been trained on so far.

train_step_timing in s=13.40: This is the key information for determining training throughput. This indicates it takes 13.40 seconds for every step in the training pass. This can be used to measure and compare performance across clusters and models.

To collect the performance measurement, find the train_step_timing in s value at the end of the training log for each model. This value is used to determine the overall performance and can be labeled with seconds/iteration. The inverse, iterations/second, can also be useful.