2. NeMo Framework on DGX Cloud
2.1. Overview
This guide provides a basic starting point for using NVIDIA’s framework for pre-training, fine-tuning, and deploying Large Language Models (LLMs), called NeMo Framework.
The document walks through using a DGX Cloud Slurm cluster as a user to launch a simple pretraining job, targeting synthetic data to minimize dependencies for initial use.
2.2. Prerequisites
To follow this document, the following items are assumed to be true:
The user has a valid NGC key which can be generated by following these steps. Save this key for future steps during cluster setup.
A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (administrator permissions not necessary). More on cluster-specific requirements can be found later in the document.
The user has access to at least two A100 or H100-based compute nodes on the cluster.
The user has read/write access to at least 100GB of shared storage.
The user can install additional Python packages via
pip
on the login node (available by runningmodule add python39
after logging in).The user has the Slurm module configured as part of their account (typically configured by an admin at user creation time, but available by running
module add slurm
after logging in).
2.3. Setting Up Your Cluster Workspace
Before testing can begin, a few steps need to be taken to properly configure the user workspace for interacting with NeMo Framework. The following sections assume you are connected to a login node via SSH as the user you intended to run jobs as.
2.3.1. Authenticating with NGC
In order to pull the NeMo FW training container from NGC, the previously noted NGC API key needs to be added to a configuration file in the DGX Cloud Slurm cluster. Authorization can be provided by following the appropriate section in the DGX Cloud User Guide if not already completed.
2.3.2. Pulling the NeMo Framework repository
The NeMo Framework which is used to launch data prep and training jobs is available on GitHub. The repository can be pulled directly from GitHub onto the login node. The location where the repository is cloned to needs to be on a Lustre filesystem, which will be accessible on all compute nodes.
This location is cluster dependent and should have been provided to you during onboarding to the cluster. Ask cluster admins for this information if not available.
Note: if your cluster was set up according to the DGX Cloud Admin Guide, your shared storage will be located at /lustre/fs0/scratch/<user-name>
.
Navigate to your user’s Lustre filesystem directory. Next, clone the repository from GitHub. A git reset
command is used to ensure a match with the code tested as part of this guide.
1cd /lustre/fs0/scratch/<user-name> 2git clone https://github.com/nvidia/nemo-megatron-launcher 3cd nemo-megatron-launcher 4git reset --hard 51df3f36f5bc51b7bbdc3f540a43b01cdc28c8be
2.3.3. Configuring NeMo Framework
With the repository cloned, the Python dependencies used to launch the script need to be installed.
To do so we need to load the python39
and slurm
modules if not already loaded. DGX Cloud comes pre-configured with several modules relating to various applications like Python, Slurm, OpenMPI, etc. To load the python39
and slurm
modules, run:
module add python39 slurm
You can also run the following, which will add these modules to your user profile automatically during future logins to the cluster.
module initadd python39 slurm
Install the Python dependencies for NeMo FW with the following. This assumes you are in the nemo-megatron-launcher
directory that was cloned in a previous step.
pip3 install -r requirements.txt
NeMo Framework has a series of config files which are used to tailor training, fine-tuning, data prep, and more for your specific needs.
The config files are located at launcher_scripts/conf
inside the nemo-megatron-launcher
directory.
The main config file is launcher_scripts/conf/config.yaml
which contains high-level configuration settings that will be used for all stages of NeMo FW.
Open the config.yaml
file and make the following edits:
Line 6: Change
gpt3/5b
togpt3/7b_improved
.Line 32: Uncomment
training
. This indicates we want to run the training stage.Line 33: Comment out
conversion
by adding#
to the beginning of the line following the pattern of the other lines in this section. We do not want to run model conversion to convert the distributed checkpoint to the .nemo format at this time. For our performance and validation purposes we do not care about model conversion.Line 44: Replace the
???
with the full path to thenemo-megatron-launcher/launcher_scripts
directory. Again, this will be cluster dependent but must match the location of where the repository was cloned in the previous section. As an example, if your repository was cloned to/lustre/fs0/scratch/<user-name>/nemo-megatron-launcher
you would enter/lustre/fs0/scratch/<user-name>/nemo-megatron-launcher/launcher_scripts
. The path must end withlauncher_scripts
.Line 48: replace
null
with/cm/shared
.In the
env_vars
section starting on line 56: DGX Cloud clusters need the following settings in order to use the compute network for optimal performance:1NCCL_TOPO_FILE: /cm/shared/etc/ndv4-topo.xml 2UCX_IB_PCI_RELAXED_ORDERING: null 3NCCL_IB_PCI_RELAXED_ORDERING: 1 4NCCL_IB_TIMEOUT: null 5NCCL_DEBUG: null 6NCCL_PROTO: LL,LL128,Simple 7TRANSFORMERS_OFFLINE: 0 8TORCH_NCCL_AVOID_RECORD_STREAMS: 1 9NCCL_NVLS_ENABLE: 0 10NVTE_DP_AMAX_REDUCE_INTERVAL: 0 11NVTE_ASYNC_AMAX_REDUCTION: 1 12NVTE_FUSED_ATTN: 0 13HYDRA_FULL_ERROR: 1 14OMPI_MCA_coll_hcoll_enable: 0 15UCX_TLS: rc 16UCX_NET_DEVICES: mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1 17CUDA_DEVICE_ORDER: PCI_BUS_ID 18NCCL_SOCKET_IFNAME: eth0 19NCCL_ALGO: Tree,Ring,CollnetDirect,CollnetChain,NVLS 20MELLANOX_VISIBLE_DEVICES: all 21PMIX_MCA_gds: hash 22PMIX_MCA_psec: native
Regardless of the cluster-specific settings to enable high-speed compute networking, set
TRANSFORMERS_OFFLINE
to0
instead of1
. This will allow tokenizers to be downloaded from the internet if not found locally which is expected to be the case on new clusters.
In addition to the main config file, the cluster config might need to be updated. A DGX Cloud Slurm cluster with default configurations will not require modifications to this file.
Specifically, this depends on how the cluster is configured and if there are any custom partitions or account names that need to be used.
Open the cluster config file at launcher_scripts/conf/cluster/bcm.yaml
.
If the Slurm cluster has a custom non-default partition or account that jobs need to run on, specify those in the file on the account
and partition
lines.
2.4. Running a Training Job on Slurm Using Synthetic Data
Similar to the core config file there are a few training-specific config changes that need to be made for this sample synthetic data-based training job.
2.4.1. Configuring the Training Job
Next, we need to update some of the settings for the 7b_improved
model.
Open the model’s config file at launcher_scripts/conf/training/gpt3/7b_improved.yaml
.
Note that the launcher_scripts/conf/training/gpt3
directory contains all of the default configurations for the various model sizes that NVIDIA has validated.
Make the following changes to the 7b_improved.yaml
config file:
Line 8: Update this to a
time_limit
value of0-02:00:00
. This will end the test run at two hours.Line 12: Update this to the number of nodes you want to test on. For example, if you have 4 nodes that you want to run this job on set this value to
4
.Line 20: Set the maximum number of steps to
2000
. This is the number of steps the training will run for. The higher the number of steps, the more stable performance will be, though it will take longer to reach higher steps.Line 21: Set the
max_time
value to00:01:30:00
. This will ensure the training run allows for 30 minutes of time in the overall training run to cleanly end the run and write a checkpoint to shared storage.Line 23: Set the
val_check_interval
to240
. This is the number of steps the training job will take before running a validation pass. This number must be less than or equal to the maximum number of steps listed above.Line 169: Change the
data_impl
value to frommmap
tomock
- this ensures synthetic data is generated and used.
2.4.2. Running the Training Job
After updating the training config file, it’s time to launch the training job.
Navigate to the nemo-megatron-launcher/launcher_scripts
directory and run:
python3 main.py
This will queue up a training job for the 7b_improved model once resources are available.
2.4.3. Monitoring the Training Job
Once the training jobs are submitted, they can be viewed in the queue with squeue
. The output should look similar to the following:
1squeue 2 3JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4181 defq nemo-meg demo-use R 0:01 4 gpu[001-004]
The training logs of each job associated with a specific model can be found at nemo-megatron-launcher/launcher_scripts/results/<model name>/log-nemo-megatron-<model name>_NNN
where NNN is the job number. There will be both a .out
file and a .err
file for the job’s stdout and stderr, respectively.
For the sample we are running, <model name>
will be gpt_7b_improved
.
To view the live progress, these output of these files can be followed with tail -f <filename>
, which will display updates on the terminal as they are written to the file.
2.4.4. Interpreting Training Performance
While the model is training, progress will be updated at the bottom of the log files. Open the log files and you should see a line that looks similar to this at the end:
Epoch 0: : 2%|▏ | 180/10050 [40:46<37:16:04, v_num=m9yd, reduced_train_loss=5.640, global_step=179.0, consumed_samples=92160.0, train_step_timing in s=13.40]
This is the training progress bar and status information. Breaking down the line we have:
Epoch 0: This indicates we are in the first epoch of the training dataset.
2%: The job is 2% through the maximum number of steps we wanted to train for.
40:46<37:16:04: Training has run for 40 minutes and 46 seconds so far and it is expected to take another 37 hours to finish.
v_num=m9yd: This is the validation loss but since validation passes are only done every 2,000 steps by default, this is an uninitialized value.
reduced_train_loss=5.640: This is the latest training loss. Ideally this should decrease over time while training.
global_step=179.0: This is the number of steps that have been registered in the results. It roughly matches the steps listed earlier.
consumed_samples=92160.0: This is the number of samples in the dataset that have been processed so far. If multiplied by the sequence length, typically 2048, it equals the total number of tokens the model has been trained on so far.
train_step_timing in s=13.40: This is the key information for determining training throughput. This indicates it takes 13.40 seconds for every step in the training pass. This can be used to measure and compare performance across clusters and models.
To collect the performance measurement, find the train_step_timing in s
value at the end of the training log for each model. This value is used to determine the overall performance and can be labeled with seconds/iteration. The inverse, iterations/second, can also be useful.