Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Run NeMo Framework on Kubernetes
NeMo Framework supports DGX A100 and H100-based Kubernetes (K8s) clusters with compute networking. Currently, we support NeMo stages such as data preparation, base model pre-training, PEFT, and NeMo Aligner for GPT-based models.
This document explains how to set up your K8s cluster and your local environment. It also provides examples of different use cases.
Prerequisites
This playbook downloads and installs additional third-party, open-source software projects. Before using them, review the license terms associated with these open-source projects.
Software Component Versions
This section identifies the software component versions and services that were validated in this project. While it is possible that other versions may function properly, we have not tested them and cannot guarantee complete compatibility or optimal performance.
We recommend that you verify the alternative versions for your specific use cases. To ensure broader compatibility, we continue to expand our testing to include new versions as they are released.
Software Component |
Version |
---|---|
Kubernetes/kubectl CLI |
v1.26.7 |
Helm |
v3.13.2 |
Kubeflow/Training-Operator |
v1.7.0 |
GPU Operator |
v24.6.2 |
Argo Operator/argo CLI |
v3.5.4 |
Network Operator |
v23.5.0 |
Local Setup
Use the Kubernetes client CLI (kubectl
) to access the cluster.
# Linux installation (stable)
mkdir -p ~/bin
KUBECTL_VERSION=stable # change if needed
if [[ $KUBECTL_VERSION == stable ]]; then
curl -L "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -o ~/bin/kubectl
else
curl -L https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o ~/bin/kubectl
fi
# Add this to your shell rc if not there already, e.g., .bashrc
export PATH="~/bin/:${PATH}"
For more information about installation, see https://kubernetes.io/docs/tasks/tools.
After downloading the client, you need to set up and configure your kubeconfig to access your cluster. Work with your cluster admin to set up your kubeconfig.
Note
If you use a Bright Cluster Manager (BCM) based K8s cluster, you do not need to install kubectl
since it is already installed.
Use the Argo client CLI (argo
) to monitor and access logs of the submitted workflows.
# Linux installation
mkdir -p ~/bin
ARGO_VERSION=v3.5.4 # change if needed
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
mv ./argo-linux-amd64 ./argo
# Add this to your shell rc if not there already, e.g., .bashrc
export PATH="~/bin/:${PATH}"
For more information about installation, see https://github.com/argoproj/argo-workflows/releases.
Cluster Setup
Important
This document assumes that a Kubernetes cluster has been provisioned with DGX A100s or DGX H100s have been set as worker nodes with optional IB network interfaces. Some situations may require you to add or modify cluster-scoped resources. Work with your organization and cluster admin if you need to set them up.
Cluster Setup Tools
When setting up the cluster, you may need to use CLI tools. If you submit NeMo workloads after the cluster has been configured, you do not need to use them.
Helm
helm
is required to enable Kubernetes Operators to run the NeMo Framework on Kubernetes.
Note
If your cluster is already configured with the necessary operators, skip the helm
installation step and the remaining steps in the Cluster Setup. Proceed to Storage Options to complete the next step.
Note
Adding cluster-level resources is usually done by your cluster admin.
# Linux installation
mkdir -p ~/bin
HELM_VERSION=v3.13.2
tmp_helm_dir=$(mktemp -d 2>/dev/null || mktemp -d -t 'helm')
mkdir -p $tmp_helm_dir
wget -qO- https://get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz | tar -xz -C $tmp_helm_dir --strip-components=1
cp $tmp_helm_dir/helm ~/bin/helm
rm -rf $tmp_helm_dir
# Add this to your shell rc if not there already, e.g., .bashrc
export PATH="~/bin/:${PATH}"
Note
Helm has a version skew with Kubernetes. Ensure that you install the correct version that matches your K8s version.
For more installation information, see https://helm.sh/docs/intro/install.
Cluster Operators
This section lists the required Kubernetes Operators and provides example installation instructions.
The GPU Operator manages the NVIDIA software components needed to provision GPUs to pods.
VERSION=v24.6.2
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
--version $VERSION \
nvidia/gpu-operator --set driver.enabled=false
Note
The version of the GPU Operator you install may depend on your Kubernetes version. Verify that your Kubernetes version and operating system are supported in the GPU operator support matrix.
For more information about installation, see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html.
The Argo Operator provides the Workflow Custom Resource to enable control flow of steps on a K8s cluster.
VERSION=v3.5.4
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/${VERSION}/quick-start-minimal.yaml
kubectl apply -f argo-rbac.yaml
For more information about installation, see https://argo-workflows.readthedocs.io/en/latest/quick-start/#install-argo-workflows.
Warning
This playbook assumes that the Argo Operator can create resources in your namepsace under the default
ServiceAccount. To grant the default ServiceAccount the necessary permissions for Argo workflows in a specific namespace, create
the following Role
and RoleBinding
. Perform these steps in each namespace where you intend to run your applications.
MY_NAMESPACE=<...>
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: $MY_NAMESPACE
name: all-resources-manager
rules:
- apiGroups: ["*"] # This specifies all API groups
resources: ["*"] # This grants access to all resources
verbs: ["*"] # This specifies the actions the role can perform
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: all-resources-manager-binding
namespace: $MY_NAMESPACE
subjects:
- kind: ServiceAccount
name: default # Name of the Service Account
namespace: $MY_NAMESPACE
roleRef:
kind: Role
name: all-resources-manager
apiGroup: rbac.authorization.k8s.io
EOF
NOTE: This action gives the default
ServiceAccount elevated privileges in that namespace; however,
it is not recommended for production clusters. When setting up Role-Based Access Control (RBAC) in your organization’s Kubernetes cluster, the cluster admin decides how to configure permissions. Work with your cluster admin to find the RBAC setup that works for your use case. For more information, refer to the Argo docs for
their recommendations .
The Kubeflow Training Operator provides multi-node training on Kubernetes Custom Resources such as PytorchJob
and MPIJob
.
VERSION=v1.7.0
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=$VERSION"
For more information about installation, see https://github.com/kubeflow/training-operator.
The Network Operator provides an easy way to install, configure, and manage the lifecycle of the NVIDIA Mellanox Network Operator. It is only required if you have IB network interfaces.
VERSION=v23.5.0
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install network-operator nvidia/network-operator \
-n nvidia-network-operator \
--create-namespace \
--version ${VERSION} \
--wait
For more information about installation, see Vanilla Kubernetes Cluster.
Storage Options
Note
Setting up storage in a Kubernetes cluster is usually the responsibility of the cluster admin. Work with your cluster admin to understand the storage options that are available to you.
Currently, the following volume types are supported:
- PersistentVolumeClaims (PVC)
Typically, in production or development clusters, StorageClasses allow users to allocate storage for their workloads without needing to configure the specific storage type. For example, an admin can setup NFS storage through a StorageClass.
The examples in this playbook use PersistentVolumeClaims, but you can use NFS or HostPath with any of these examples.
- NFS
Easy to setup on Cloud Service Providers (CSP).
To use NFS storage directly in these examples, you need the IP address of the server.
- HostPath
Suitable for use with single-node clusters or clusters where the path mounted is available on all the worker nodes.
If there is a different storage option you’d like to see supported, open an issue on NeMo-Framework-Launcher.
Set up PVC Storage
PersistentVolumeClaims (PVC) store your data, checkpoints, and results. To perform the setup procedure, it is assumed that your PVC is already in place.
The following example shows how to create a dynamic PVC from a StorageClass that was set up by your cluster admin. Replace STORAGE_CLASS=<...>
with the name of your StorageClass.
This example requests 150Gi
of space. Adjust this number for your workloads, but keep in mind that not all storage provisioners support volume resizing.
STORAGE_CLASS=<...>
PVC_NAME=nemo-workspace
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${PVC_NAME}
spec:
accessModes:
- ReadWriteMany
storageClassName: ${STORAGE_CLASS}
resources:
requests:
# Requesting enough storage for a few experiments
storage: 150Gi
EOF
Note
The storage class must support ReadWriteMany
because multiple pods may need to access the PVC to perform concurrent read and write operations.
Set up PVC Busybox Helper Pod
The process of inspecting the PVC and copying data to and from it are made easier with a busybox container. The following example assumes that the pod is set up for copying data to and from the PVC.
PVC_NAME=nemo-workspace
MOUNT_PATH=/nemo-workspace
kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nemo-workspace-busybox
spec:
containers:
- name: busybox
image: busybox
command: ["sleep", "infinity"]
volumeMounts:
- name: workspace
mountPath: ${MOUNT_PATH}
volumes:
- name: workspace
persistentVolumeClaim:
claimName: ${PVC_NAME}
EOF
If the container is no longer required, you can remove it, as it should use minimal resources when idle.
kubectl delete pod nemo-workspace-busybox
Set up Docker Secrets
To set up docker secrets, create a secret key on the K8s cluster to authenticate with the NGC private registry. If you have not done so already, get an NGC key from ngc.nvidia.com.
Create a secret key on the K8s cluster and replace <NGC KEY HERE>
with your NGC secret
key. In addition, if you have any special characters in your key, wrap the key in single quotes ('
) so it can be parsed correctly by K8s.
kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>
The name of this secret (ngc-registry
) is the default used by the launcher. To configure a different pull secret, add the following commands to your launcher
invocation:
python3 main.py \
...
cluster=k8s_v2 \
cluster.pull_secret=<my-other-pull-secret> \
...
NeMo Stages
To get started, make a copy of the NeMo-Framework-Launcher on the same node where the kubectl binary and argo
are located. You will use it to access your K8s cluster.
git clone https://github.com/NVIDIA/NeMo-Framework-Launcher
Set up a virtual environment:
cd NeMo-Framework-Launcher
python3 -m venv venv
source activate venv/bin/activate
pip install -r requirements.txt
Navigate to the launcher_scripts
directory:
cd launcher_scripts
The following example downloads and preprocesses The Pile dataset onto the nemo-workspace
PVC
using CPU resources:
PVC_NAME=nemo-workspace # Must already exist
MOUNT_PATH=/nemo-workspace # Path within the container
PYTHONPATH=$PWD python main.py \
launcher_scripts_path=$PWD \
data_dir=$MOUNT_PATH/pile \
\
cluster=k8s_v2 \
cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
cluster.volumes.workspace.mount_path=$MOUNT_PATH \
\
'stages=[data_preparation]' \
data_preparation=gpt3/download_gpt3_pile \
data_preparation.file_numbers="0-29" \
data_preparation.run.node_array_size=30 \
data_preparation.run.bcp_preproc_npernode=2 \
\
env_vars.TRANSFORMERS_OFFLINE=0
# Should see the following message from stdout:
# workflow.argoproj.io/pile-prep-ldj45 created
launcher_scripts_path=$PWD
: Must be set for all launcher commands.data_dir=$MOUNT_PATH/pile
: Container location of downloaded and preprocessed data. Ensure the volume backing has enough storage. If this path is not present in one of thecluster.volumes.*.mount_path
, an error will be raised.cluster=k8s_v2
: Selects Kubernetes launcher type.cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME
: Name of previously created PVC.cluster.volumes.workspace.mount_path=$MOUNT_PATH
: Mounts theworkspace
volume into all containers at this path.'stages=[data_preparation]'
: Selects only the data preparation stage.data_preparation=gpt3/download_gpt3_pile
: Selects the scripts and procedure for data preparation.data_preparation.file_numbers="0-29"
: Set0-29
to something lower, such as0-1
if you don’t need the full dataset (or would like to iterate quickly).data_preparation.run.node_array_size=30
: Controls the number of workers. Set it to the same number offile_numbers
.data_preparation.run.bcp_preproc_npernode=2
: Controls how many processes per workers. (Note: While the name suggests BCP, this is a platform-agnostic config option).
To explore other data_preparation.*
options, refer config source.
Depending on the number of nodes requested, this operation can take a few hours to download all 30 shards of The Pile dataset, extract each shard, and then preprocess the extracted files.
After submitting the job, you can monitor the overall progress:
argo watch @latest
# Or you can reference the workflow directly
argo watch pile-prep-ldj45
This action opens a full-screen status that updates periodically.
ServiceAccount: unset (will run with the default ServiceAccount)
Name: pile-prep-ldj45
Namespace: frameworks
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning False
Created: Mon Mar 18 11:53:29 -0700 (5 minutes ago)
Started: Mon Mar 18 11:53:29 -0700 (5 minutes ago)
Duration: 5 minutes 57 seconds
Progress: 2/3
ResourcesDuration: 9s*(1 cpu),9s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
● pile-prep-ldj45 data-steps
├───✔ download-merges download-tokenizer pile-prep-ldj45-download-tokenizer-1159852991 5s
├───✔ download-vocab download-tokenizer pile-prep-ldj45-download-tokenizer-3211248646 6s
└───● mpijob pile-prep- pile-prep-ldj45-pile-prep--227527442 5m
The STEP
named mpijob
may take a while because it is downloading, extracting, and preprocessing the data. You can monitor the progress of the MPIJob and inspect its logs by using the following command, which tracks the logs of the launcher containers:
argo logs @latest -f -c mpi-launcher
# Or you can reference the workflow directly
argo logs pile-prep-ldj45 -f -c mpi-launcher
Once the job is finished, you will see Status: Succeeded
in argo watch @latest
.
Finally, use the nemo-workspace-busybox
pod to view the final processed files. Here is
an example output assuming file_numbers="0-1"
.
kubectl exec nemo-workspace-busybox -- ls -lh /nemo-workspace/pile
# total 37G
# drwxrwxrwx 2 root root 2 Mar 18 18:53 bpe
# -rw-r--r-- 1 root root 18.3G Mar 18 19:25 my-gpt3_00_text_document.bin
# -rw-r--r-- 1 root root 112.5M Mar 18 19:25 my-gpt3_00_text_document.idx
# -rw-r--r-- 1 root root 18.3G Mar 18 19:24 my-gpt3_01_text_document.bin
# -rw-r--r-- 1 root root 112.6M Mar 18 19:24 my-gpt3_01_text_document.idx
To inspect the logs of the other steps:
# To inspect logs of `download-merges`
argo logs @latest -f pile-prep-ldj45-download-tokenizer-1159852991
# To inspect logs of `download-merges`
argo logs @latest -f pile-prep-ldj45-download-tokenizer-3211248646
Afterwards, clean up the workflow:
argo delete pile-prep-ldj45
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest
The following example shows how to pre-train GPT-3 1B with The Pile dataset. Go to the “Data Prep” tab for information on how to generate The Pile dataset used in this training.
PVC_NAME=nemo-workspace # Must already exist
MOUNT_PATH=/nemo-workspace # Path within the container
PYTHONPATH=$PWD python main.py \
launcher_scripts_path=$PWD \
data_dir=/$MOUNT_PATH/pile \
\
cluster=k8s_v2 \
cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
cluster.volumes.workspace.mount_path=$MOUNT_PATH \
\
'stages=[training]' \
training=gpt3/1b_improved \
"training.exp_manager.explicit_log_dir=$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results" \
\
"training.model.data.data_prefix=[0.5,$MOUNT_PATH/pile/my-gpt3_00_text_document,0.5,$MOUNT_PATH/pile/my-gpt3_01_text_document]" \
training.trainer.num_nodes=8 \
training.trainer.devices=8 \
training.trainer.max_steps=300000 \
training.trainer.val_check_interval=2000 \
training.model.global_batch_size=512
# Should see the following message from stdout:
# workflow.argoproj.io/training-tmlzw created
launcher_scripts_path=$PWD
: Must be set for all launcher commands.data_dir=$MOUNT_PATH/pile
: Container location of previously downloaded and preprocessed data. If this path is not present in one of thecluster.volumes.*.mount_path
, an error will be raised.cluster=k8s_v2
: Selects Kubernetes launcher type.cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME
: Name of previously created PVC.cluster.volumes.workspace.mount_path=$MOUNT_PATH
: Mounts theworkspace
volume into all containers at this path.'stages=[training]'
: Selects only the training stage.training=gpt3/1b_improved
: Specifies the model type and size you want to train. Explore other configs here."training.exp_manager.explicit_log_dir=/$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results"
: Controls where logs and checkpoints go. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised."training.model.data.data_prefix=[0.5,/$MOUNT_PATH/pile/my-gpt3_00_text_document,0.5,/$MOUNT_PATH/pile/my-gpt3_01_text_document]"
: List of proportions and files the training job should use. Note that the file path prefixes should all be present in one of thecluster.volumes.*.mount_path
or an error will be raised. In this example we only use two.training.trainer.num_nodes=8
: Controls the number of workers to use. (Note: K8s scheduler may end up assigning two workers on same node ifdevices<=4
and you have 8 GPUs per node).training.trainer.devices=8
: Controls how many GPUs per worker to use.training.trainer.max_steps=300000
: Controls the max training steps.training.trainer.val_check_interval=2000
: Sets the validation interval. Should be less than or equal to, and a multiple ofmax_steps
.training.model.global_batch_size=512
: Control Global Batch Size. May need to change if using fewernum_nodes
anddevices
.
Tip
You can append -c job
to your python main.py
invocation to view the job config before submitting it. It’s also helpful
to view the defaults of a training job before specifying your own overrides.
After submitting the job, you can monitor the overall progress:
argo watch @latest
# Or you can reference the workflow directly
argo watch training-tmlzw
This action opens a full-screen status that updates periodically.
Name: training-tmlzw
Namespace: frameworks
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning True
Created: Tue Mar 19 09:07:41 -0700 (1 minute ago)
Started: Tue Mar 19 09:07:41 -0700 (1 minute ago)
Duration: 1 minute 50 seconds
Progress: 0/1
STEP TEMPLATE PODNAME DURATION MESSAGE
● training-tmlzw training-steps
└───● pytorchjob training- training-tmlzw-training--1520170578 1m
The STEP
named pytorchjob
may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers:
# See the logs of all the workers
argo logs @latest -f -c pytorch
# See the logs of second worker
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly
argo logs training-tmlzw -f -c pytorch
Once the job is finished, you will see Status: Succeeded
in argo watch @latest
.
Finally, use the nemo-workspace-busybox
pod to view the final checkpoints, logs, and results:
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/gpt_1b_improved/training_gpt_1b_improved/results
# total 114K
# drwxr-xr-x 4 root root 9 Mar 19 16:11 .
# drwxr-xr-x 3 root root 1 Mar 19 16:08 ..
# drwxr-xr-x 4 root root 2 Mar 19 16:15 checkpoints
# -rw-r--r-- 1 root root 106 Mar 19 16:11 cmd-args.log
# -rw-r--r-- 1 root root 11.3K Mar 19 16:13 events.out.tfevents.1710864691.training-hlwd8-worker-0.94.0
# -rw-r--r-- 1 root root 54 Mar 19 16:11 git-info.log
# -rw-r--r-- 1 root root 3.1K Mar 19 16:11 hparams.yaml
# -rw-r--r-- 1 root root 1.0K Mar 19 16:15 lightning_logs.txt
# -rw-r--r-- 1 root root 17.6K Mar 19 16:12 nemo_error_log.txt
# -rw-r--r-- 1 root root 33.0K Mar 19 16:13 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x 2 root root 9 Mar 19 16:11 run_0
Afterwards, clean up the workflow:
argo delete training-tmlzw
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest
The following example shows how to apply the LoRa PEFT method to the LLaMA2 7b model. The hyperparameters and configuration assume four units of the NVIDIA A100 Tensor Core GPU with 80GB of high-bandwidth memory (4 A100-80G).
Note
To download checkpoints, use git-lfs. Go to https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing for installation instructions applicable for your environment.
Additionally, the LLaMA checkpoints require you to seek approval from Meta. After obtaining access, you can clone them using your Hugging Face username and access token as the password. Create your access token here.
First, download the Hugging Face checkpoint:
# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Second, convert the checkpoint to .nemo
format:
NEMO_IMAGE=nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.11
docker run --rm -it -v $LOCAL_WORKSPACE:/workspace $NEMO_IMAGE \
python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
--input_name_or_path /workspace/Llama-2-7b-chat-hf \
--output_path /workspace/llama2_7b.nemo
To see other prepared NeMo models, go to:
Third, copy it into the nemo-workspace
PVC. In this step, use kubectl exec
instead of
kubectl cp
since .nemo
is not preserved correctly over kubectl cp
.
( cd $LOCAL_WORKSPACE; tar cf - llama2_7b.nemo | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )
In the final step, submit the training job. This example also downloads the squad dataset into the PVC by default.
PVC_NAME=nemo-workspace # Must already exist
MOUNT_PATH=/nemo-workspace # Path within the container
PYTHONPATH=$PWD python main.py \
launcher_scripts_path=$PWD \
data_dir=$MOUNT_PATH \
\
cluster=k8s_v2 \
cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
cluster.volumes.workspace.mount_path=$MOUNT_PATH \
\
stages=[peft] \
peft=llama/squad \
peft.run.name=llama-7b-peft-lora \
peft.exp_manager.explicit_log_dir="$MOUNT_PATH/\${peft.run.model_train_name}/peft_\${peft.run.name}/results" \
peft.trainer.num_nodes=1 \
peft.trainer.devices=4 \
peft.trainer.max_epochs=null \
peft.trainer.max_steps=2000 \
peft.model.global_batch_size=128 \
peft.model.micro_batch_size=1 \
peft.model.restore_from_path=$MOUNT_PATH/llama2_7b.nemo
# Should see the following message from stdout:
# workflow.argoproj.io/peft-qqkxh created
Note
If you use an image earlier than 23.11, add this additional flag peft.model.megatron_amp_O2=false
.
launcher_scripts_path=$PWD
: Must be set for all launcher commands.data_dir=$MOUNT_PATH
: Container location of where squad is downloaded and referenced. If this path is not present in one of thecluster.volumes.*.mount_path
, an error will be raised.cluster=k8s_v2
: Selects Kubernetes launcher type.cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME
: Name of previously created PVC.cluster.volumes.workspace.mount_path=$MOUNT_PATH
: Mounts theworkspace
volume into all containers at this path.'stages=[peft]'
: Selects only the PEFT stage.peft=llama/squad
: Specifies the model type and size you want to train. Explore other configs here.peft.run.name=llama-7b-peft-lora
: Name of this run; it can be anything. Used as part of the local result dir path and the remoteexplicit_log_dir
path.peft.exp_manager.explicit_log_dir="$MOUNT_PATH/\${peft.run.model_train_name}/peft_\${peft.run.name}/results"
: Controls where logs and checkpoints go. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.peft.trainer.num_nodes=1
: Controls the number of workers to use (Note: K8s scheduler may end up assigning two workers on same node ifdevices<=4
and you have 8 GPUs per node).peft.trainer.devices=4
: Controls how many GPUs per worker to use.peft.trainer.max_epochs=null
: Rely onmax_steps
instead.peft.trainer.max_steps=2000
: Controls the max training steps.peft.model.global_batch_size=128
: Control Global Batch Size. May need to change if using fewernum_nodes
anddevices
.peft.model.micro_batch_size=1
: Controls Micro Batch Size.peft.model.restore_from_path=$MOUNT_PATH/llama2_7b.nemo
: Path to pre-trained Nemo LLaMA2 model. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.
Tip
You can append -c job
to your python main.py ...
invocation to view the job config before submitting it. It’s also helpful
to view the defaults of a training job before specifying your own overrides.
After submitting the job, you can monitor the overall progress:
argo watch @latest
# Or you can reference the workflow directly
argo watch peft-qqkxh
This action opens a full-screen status that updates periodically.
Name: peft-qqkxh
Namespace: frameworks
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning False
Created: Mon Mar 18 17:21:13 -0700 (5 minutes ago)
Started: Mon Mar 18 17:21:13 -0700 (5 minutes ago)
Duration: 5 minutes 8 seconds
Progress: 1/2
ResourcesDuration: 19s*(1 cpu),19s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
● peft-qqkxh peft-steps
├───✔ download-squad download-squad peft-qqkxh-download-squad-1305096357 12s
└───● pytorchjob peft- peft-qqkxh-peft--105064823 4m
The STEP
named pytorchjob
may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers:
# See the logs of all the workers
argo logs @latest -f -c pytorch
# See the logs of second worker
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly
argo logs peft-qqkxh -f -c pytorch
Once the job is finished, you will see Status: Succeeded
in argo watch @latest
.
Finally, use the nemo-workspace-busybox
pod to view the final checkpoints, logs, and results:
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/llama2_7b/peft_llama-7b-peft-lora/results
# total 280K
# drwxr-xr-x 31 root root 36 Mar 19 01:04 .
# drwxr-xr-x 3 root root 1 Mar 19 00:22 ..
# drwxr-xr-x 2 root root 3 Mar 19 03:52 checkpoints
# -rw-r--r-- 1 root root 113 Mar 19 00:39 cmd-args.log
# -rw-r--r-- 1 root root 153.6K Mar 19 03:51 events.out.tfevents.1710809120.peft-86b72-worker-0.4232.0
# -rw-r--r-- 1 root root 54 Mar 19 00:39 git-info.log
# -rw-r--r-- 1 root root 6.0K Mar 19 00:45 hparams.yaml
# -rw-r--r-- 1 root root 3.2K Mar 19 03:52 lightning_logs.txt
# -rw-r--r-- 1 root root 20.6K Mar 19 00:45 nemo_error_log.txt
# -rw-r--r-- 1 root root 40.7K Mar 19 03:52 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x 2 root root 2 Mar 19 00:22 run_0
To inspect the logs from the other steps:
# To inspect logs of `download-squad`
argo logs @latest -f peft-qqkxh-download-squad-1305096357
Afterwards, can clean up the workflow:
argo delete peft-qqkxh
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest
The following example shows how to fine-tune GPT-3 2B with the Anthropic-HH dataset.
Note
To learn more about Reward Model training, visit the Model Alignment by RLHF documentation.
First, download the Anthropic-HH dataset onto the PVC. In this step, you download it locally and
then upload it to the PVC through the nemo-workspace-busybox
pod:
# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
NEMO_IMAGE=nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.11
docker run --rm -it -v $LOCAL_WORKSPACE/anthropic-rm-data:/workspace $NEMO_IMAGE \
python /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/anthropichh_dataprep/download_and_process.py \
--output-dir /workspace
# Then upload to the PVC
( cd $LOCAL_WORKSPACE; tar cf - anthropic-rm-data | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )
Second, check that the data was uploaded:
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/anthropic-rm-data
# total 428M
# drwxr-xr-x 2 root root 4 Mar 19 18:11 .
# drwxr-xr-x 7 root root 6 Mar 19 18:11 ..
# -rw-r--r-- 1 root root 16.1M Mar 19 18:10 test_comparisons.jsonl
# -rw-r--r-- 1 root root 5.5M Mar 19 18:10 test_prompts.jsonl
# -rw-r--r-- 1 root root 301.2M Mar 19 18:10 train_comparisons.jsonl
# -rw-r--r-- 1 root root 101.9M Mar 19 18:10 train_prompts.jsonl
Next, download the pre-trained GPT 2B NeMo checkpoint:
cd $LOCAL_WORKSPACE
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
mkdir 2b_model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C 2b_model_checkpoint
Note
For more information about pretrained models, see Prerequisite: Obtaining a pretrained model
After downloading the checkpoint, convert it to the Megatron Core format and then upload it to the PVC:
# Converts checkpoint to $LOCAL_WORKSPACE/2b_mcore_gpt.nemo
docker run --rm -it \
-v $LOCAL_WORKSPACE/2b_model_checkpoint:/inputs \
-v $LOCAL_WORKSPACE:/outputs \
nvcr.io/nvidia/nemo:24.07 bash -exc "
git -C /opt/NeMo checkout 438db620bffdf4e2d4cef6368d0e86be2a02b7c3
python /opt/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py \
--in-folder /inputs --out-file /outputs/2b_mcore_gpt.nemo --cpu-only"
# Validate the local file
ls -lah $LOCAL_WORKSPACE/2b_mcore_gpt.nemo
# -rw-r--r-- 1 root root 4.3G Mar 19 14:08 /tmp/2b_mcore_gpt.nemo
# Then upload to the PVC
( cd $LOCAL_WORKSPACE; tar cf - 2b_mcore_gpt.nemo | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )
After the checkpoint uploads, launch the fine-tuning job:
PVC_NAME=nemo-workspace # Must already exist
MOUNT_PATH=/nemo-workspace # Path within the container
PYTHONPATH=$PWD python main.py \
launcher_scripts_path=$PWD \
container=nvcr.io/nvidia/nemo:24.07 \
\
cluster=k8s_v2 \
cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
cluster.volumes.workspace.mount_path=$MOUNT_PATH \
\
'stages=[rlhf_rm]' \
rlhf_rm=gpt3/2b_rm \
"rlhf_rm.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_rm.run.name}/rlhf_rm_\${rlhf_rm.run.name}/results" \
"rlhf_rm.model.data.data_prefix={train: [$MOUNT_PATH/anthropic-rm-data/train_comparisons.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl]}" \
rlhf_rm.model.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/2b_mcore_gpt.nemo \
rlhf_rm.trainer.num_nodes=8 \
rlhf_rm.trainer.devices=8 \
rlhf_rm.trainer.rm.max_epochs=1 \
rlhf_rm.trainer.rm.max_steps=-1 \
rlhf_rm.trainer.rm.val_check_interval=100 \
rlhf_rm.model.global_batch_size=64
# Should see the following message from stdout:
# workflow.argoproj.io/rlhf-rm-d65f4 created
launcher_scripts_path=$PWD
: Must be set for all launcher commands.container=nvcr.io/nvidia/nemo:24.07
: The docker image used for training. Visit here to choose other framework containers.cluster=k8s_v2
: Selects Kubernetes launcher type.cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME
: Name of previously created PVC.cluster.volumes.workspace.mount_path=$MOUNT_PATH
: Mounts theworkspace
volume into all containers at this path.'stages=[rlhf_rm]'
: Selects only the RLHF Reward Model stage.rlhf_rm=gpt3/2b_rm
: Specifies the model type and size you want to train. Explore other configs here."rlhf_rm.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_rm.run.name}/rlhf_rm_\${rlhf_rm.run.name}/results"
: Controls where logs and checkpoints go. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised."rlhf_rm.model.data.data_prefix={train: [$MOUNT_PATH/anthropic-rm-data/train_comparisons.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl]}"
: Train/Validation/Test splits.rlhf_rm.model.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/2b_mcore_gpt.nemo
: Path to pre-trained 2B model. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.rlhf_rm.trainer.num_nodes=8
: Controls the number of workers to use (Note: K8s scheduler may end up assigning two workers on same node ifdevices<=4
and you have 8 GPUs per node).rlhf_rm.trainer.devices=8
: Controls how many GPUs per worker to use.rlhf_rm.trainer.rm.max_epochs=1
: Sets the maximum epochs to train.rlhf_rm.trainer.rm.max_steps=-1
:-1
to go through all dataset.rlhf_rm.trainer.rm.val_check_interval=100
: Sets the validation interval.rlhf_rm.model.global_batch_size=64
: Control Global Batch Size. May need to change if using fewernum_nodes
anddevices
.
Tip
You can append
-c job
to yourpython main.py ...
invocation to view the job config before submitting. It’s also helpful to view the defaults of a training job before specifying your own overrides.
After submitting the job, you can monitor the overall progress:
argo watch @latest
# Or you can reference the workflow directly
argo watch rlhf-rm-d65f4
This action opens a full-screen status that updates periodically.
Name: rlhf-rm-d65f4
Namespace: frameworks
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning True
Created: Tue Mar 19 14:35:22 -0700 (53 seconds ago)
Started: Tue Mar 19 14:35:22 -0700 (53 seconds ago)
Duration: 53 seconds
Progress: 0/1
STEP TEMPLATE PODNAME DURATION MESSAGE
● rlhf-rm-d65f4 rlhf-rm-steps
└───● pytorchjob rlhf-rm- rlhf-rm-d65f4-rlhf-rm--853469757 53s
The STEP
named pytorchjob
may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers:
# See the logs of all the workers
argo logs @latest -f -c pytorch
# See the logs of second worker
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly
argo logs rlhf-rm-d65f4 -f -c pytorch
Once the job is finished, you will see Status: Succeeded
in argo watch @latest
.
Finally, use the nemo-workspace-busybox
pod to view the final checkpoints, logs, and results:
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results
# total 109K
# drwxr-xr-x 10 root root 14 Mar 19 21:50 .
# drwxr-xr-x 3 root root 1 Mar 19 21:35 ..
# drwxr-xr-x 4 root root 3 Mar 19 21:50 checkpoints
# -rw-r--r-- 1 root root 104 Mar 19 21:45 cmd-args.log
# -rw-r--r-- 1 root root 18.9K Mar 19 21:47 events.out.tfevents.1710884746.rlhf-rm-bc6sw-worker-0.94.0
# -rw-r--r-- 1 root root 4.9K Mar 19 21:50 hparams.yaml
# -rw-r--r-- 1 root root 987 Mar 19 21:47 lightning_logs.txt
# -rw-r--r-- 1 root root 18.9K Mar 19 21:45 nemo_error_log.txt
# -rw-r--r-- 1 root root 29.0K Mar 19 21:45 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x 2 root root 5 Mar 19 21:36 run_0
Afterwards, clean up the workflow:
argo delete rlhf-rm-d65f4
# Or if you are sure this is the latest workflow, then you cab run
argo delete @latest
The following example shows PPO Training with GPT 2B and the Anthropic-HH dataset. The RLHF Reward Model is a prerequisite for this stage since a Reward Model checkpoint is a requirement to initialize the actor and the critic. It is also assumed that the Anthropic-HH dataset is prepared and uploaded to the PVC.
Note
To learn more about PPO training, see Model Alignment by RLHF documentation.
PVC_NAME=nemo-workspace # Must already exist
MOUNT_PATH=/nemo-workspace # Path within the container
PYTHONPATH=$PWD python main.py \
launcher_scripts_path=$PWD \
container=nvcr.io/nvidia/nemo:24.07 \
\
cluster=k8s_v2 \
cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
cluster.volumes.workspace.mount_path=$MOUNT_PATH \
\
stages=[rlhf_ppo] \
rlhf_ppo=gpt3/2b_ppo \
rlhf_ppo.critic.trainer.num_nodes=1 \
rlhf_ppo.critic.trainer.devices=8 \
rlhf_ppo.critic.model.global_batch_size=64 \
rlhf_ppo.critic.model.tensor_model_parallel_size=1 \
rlhf_ppo.critic.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo \
rlhf_ppo.critic.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/critic_results \
rlhf_ppo.actor.trainer.num_nodes=1 \
rlhf_ppo.actor.trainer.devices=8 \
rlhf_ppo.actor.trainer.ppo.max_steps=-1 \
rlhf_ppo.actor.model.global_batch_size=64 \
rlhf_ppo.actor.model.tensor_model_parallel_size=1 \
rlhf_ppo.actor.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo \
rlhf_ppo.actor.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/actor_results \
rlhf_ppo.actor.model.data.data_prefix="{train: [$MOUNT_PATH/anthropic-rm-data/train_prompts.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl]}"
# Should see the following message from stdout:
# workflow.argoproj.io/rlhf-ppo-np8vr created
launcher_scripts_path=$PWD
: Must be set for all launcher commands.container=nvcr.io/nvidia/nemo:24.07
: The docker image used for training. Go to here to choose other framework containers.cluster=k8s_v2
: Selects Kubernetes launcher type.cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME
: Name of previously created PVC.cluster.volumes.workspace.mount_path=$MOUNT_PATH
: Mounts theworkspace
volume into all containers at this path.'stages=[rlhf_ppo]'
: Selects only the RLHF PPO stage.rlhf_ppo=gpt3/2b_ppo
: Specifies the model type and size you want to train. Explore other configs here.rlhf_ppo.critic.trainer.num_nodes=1
: Controls the number of workers to use (Note: K8s scheduler may end up assigning two workers on same node ifdevices<=4
and you have 8 GPUs per node).rlhf_ppo.critic.trainer.devices=8
: Controls how many GPUs per worker to use.rlhf_ppo.critic.model.global_batch_size=64
: Control Global Batch Size. You may need to change this value if using fewernum_nodes
anddevices
.rlhf_ppo.critic.model.tensor_model_parallel_size=1
:rlhf_ppo.critic.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo
: Path to a pre-trained 2B Reward Model. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.rlhf_ppo.critic.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/critic_results
: Controls where logs and checkpoints go. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.rlhf_ppo.actor.trainer.num_nodes=1
: Controls how many workers to use (Note: K8s scheduler may end up assigning two workers on same node ifdevices<=4
and you have 8 GPUs per node).rlhf_ppo.actor.trainer.devices=8
: Controls how many GPUs per worker to use.rlhf_ppo.actor.trainer.ppo.max_steps=-1
: Max PPO steps (-1 to go through the whole train set).rlhf_ppo.actor.model.global_batch_size=64
: Control Global Batch Size. May need to change if using fewernum_nodes
anddevices
.rlhf_ppo.actor.model.tensor_model_parallel_size=1
.rlhf_ppo.actor.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo
: Path to a pre-trained 2B Reward Model. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.rlhf_ppo.actor.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/actor_results
: Controls where logs and checkpoints go. If this path is not prefixed by one ofcluster.volumes.*.mount_path
, an error will be raised.rlhf_ppo.actor.model.data.data_prefix="{train: [$MOUNT_PATH/anthropic-rm-data/train_prompts.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl]}"
: Train/Validation/Test splits.
Tip
You can append -c job
to your python main.py ...
invocation to view the job config before submitting it. It’s also helpful
to view the defaults of a training job before specifying your own overrides.
After submitting the job, you can monitor the overall progress:
argo watch @latest
# Or you can reference the workflow directly
argo watch rlhf-ppo-np8vr
This action opens a full-screen status that updates periodically.
Name: rlhf-ppo-np8vr
Namespace: frameworks
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Running
Conditions:
PodRunning False
Created: Tue Mar 19 15:34:48 -0700 (2 minutes ago)
Started: Tue Mar 19 15:34:48 -0700 (2 minutes ago)
Duration: 2 minutes 57 seconds
Progress: 1/2
ResourcesDuration: 1s*(100Mi memory),1s*(1 cpu)
STEP TEMPLATE PODNAME DURATION MESSAGE
● rlhf-ppo-np8vr rlhf-ppo-steps
├─✔ critic- critic- rlhf-ppo-np8vr-critic--2352952381 3s
├─✔ actor- actor- rlhf-ppo-np8vr-actor--892161450 2m
└─✔ delete-pytorchjob delete-pytorchjob rlhf-ppo-np8vr-delete-pytorchjob-1202150566 2s
The STEP
named actor-
may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers.
# See the logs of all the workers (actor & critic)
argo logs @latest -f -c pytorch
# See the logs of second worker (actor & critic)
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly (actor & critic)
argo logs rlhf-ppo-np8vr -f -c pytorch
# See the logs of second actor critic (pod name from `argo watch ...`)
argo logs @latest rlhf-ppo-np8vr-actor--3744553305 -f -c pytorch -l training.kubeflow.org/replica-index=1
Once the actor is finished, the critic PytorchJob will be deleted and you will see:
Status: Succeeded
in argo watch @latest
.
Note
Once the actor finishes, the critic job is deleted. This means the critic logs cannot be queried with
kubectl logs
or argo logs
, but you can still find the logs on the PVC under the path:
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_2b_actor_2b_critic/critic_results
Finally, use the nemo-workspace-busybox
pod to view the final checkpoints, logs, and results:
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_2b_actor_2b_critic/{actor,critic}_results
# /nemo-workspace/rlhf_2b_actor_2b_critic/actor_results:
# total 257K
# drwxr-xr-x 18 root root 22 Mar 20 01:56 .
# drwxr-xr-x 4 root root 2 Mar 19 23:19 ..
# drwxr-xr-x 4 root root 3 Mar 20 01:56 checkpoints
# -rw-r--r-- 1 root root 217 Mar 19 23:30 cmd-args.log
# -rw-r--r-- 1 root root 132.3K Mar 20 01:53 events.out.tfevents.1710891065.actor-7jxt7-worker-0.94.0
# -rw-r--r-- 1 root root 5.7K Mar 20 01:56 hparams.yaml
# -rw-r--r-- 1 root root 1.0K Mar 20 01:53 lightning_logs.txt
# -rw-r--r-- 1 root root 20.2K Mar 19 23:31 nemo_error_log.txt
# -rw-r--r-- 1 root root 33.8K Mar 19 23:31 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x 2 root root 5 Mar 19 23:20 run_0
#
# /nemo-workspace/rlhf_2b_actor_2b_critic/critic_results:
# total 163K
# drwxr-xr-x 54 root root 57 Mar 20 01:53 .
# drwxr-xr-x 4 root root 2 Mar 19 23:19 ..
# drwxr-xr-x 3 root root 1 Mar 20 01:53 checkpoints
# -rw-r--r-- 1 root root 124 Mar 19 23:30 cmd-args.log
# -rw-r--r-- 1 root root 62.6K Mar 20 00:39 events.out.tfevents.1710891057.critic-j2tdp-worker-0.94.0
# -rw-r--r-- 1 root root 771 Mar 19 23:30 lightning_logs.txt
# -rw-r--r-- 1 root root 19.1K Mar 19 23:33 nemo_error_log.txt
# -rw-r--r-- 1 root root 28.5K Mar 20 01:53 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x 2 root root 5 Mar 19 22:36 run_0
Afterwards, clean up the workflow:
argo delete rlhf-ppo-np8vr
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest
Warning
There is a known issue where the critic may not terminate and release resources because the actor has reached its backoff limit. To workaround this issue, delete the workflow via
argo delete rlhf-ppo-np8vr
orargo delete @latest
.There is a known issue where the actor job may hang after the training loop is finished. Before deleting the PPO Workflow, check that the final checkpoint and results are as expected with
kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_2b_actor_2b_critic/actor_results
and then delete the workflowargo delete rlhf-ppo-np8vr
orargo delete @latest
.
Advanced Use Cases
Download Data from PVC
To download data from your PVC, use the nemo-workspace-busybox
pod created earlier:
# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
# Tar will fail if LOCAL_WORKSPACE doesn't exist
mkdir -p $LOCAL_WORKSPACE
# Copy file in PVC at /nemo-workspace/foobar.txt to local file-system at $LOCAL_WORKSPACE/nemo-workspace/foobar.txt
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/foobar.txt | tar xf - -C $LOCAL_WORKSPACE
# Copy directory in PVC /nemo-workspace/fizzbuzz to local file-system at $LOCAL_WORKSPACE/fizzbuzz
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/fizzbuzz | tar xf - -C $LOCAL_WORKSPACE
Multiple Storage Volumes
The examples used in this playbook assume that one PVC holds all of the data. If your cluster setup has data distributed over different volumes, you can add them to the configuration. The launcher will then mount them into all the child pods.
# Here is an example that attaches an additional PVC with name "nemo-data" to the stage's pods.
# The choice of `my-data-pvc` is arbitrary and is used to identify this volume when constructing
# the workflow yaml manifest.
PYTHONPATH=$PWD python3 main.py \
... \
+cluster.volumes.my-data-pvc.mount_path=/mnt/data \
+cluster.volumes.my-data-pvc.persistent_volume_claim=nemo-data
Configure the Container
Go to here for the full list of NeMo Framework containers.
To configure the launcher with a different container, add the following:
PYTHONPATH=$PWD python3 main.py \
... \
container=<other-container>
Use IB Interfaces
If you deployed the NVIDIA Network Operator with IB devices, you can configure your workloads to use
them. The following example shows how to request one nvidia.com/resibp12s0
resource and two nvidia.com/resibp186s0
resources.
PYTHONPATH=$PWD python3 main.py \
... \
cluster=k8s_v2 \
cluster.ib_interfaces.annotation=\'ibp12s0,ibp186s0\' \
+cluster.ib_interfaces.resources="{nvidia.com/resibp12s0: 1, nvidia.com/resibp186s0: 2}"
Debugging Tips
Add Linux Capabilities
In certain scenarios, you might consider incorporating Linux functionalities into the pods for debugging purposes.
To add Linux capabilities to the launched MPIJob and PytorchJob pods:
PYTHONPATH=$PWD python3 main.py \
... \
cluster.capabilities=[IPC_LOCK,SYS_PTRACE]
Perform a Dry Run of the Submission
To dry run the submission, set the environment variable, NEMO_LAUNCHER_DEBUG=1
, which effectively replaces
kubectl create
with kubectl create --dry-run=client
:
# Dry-run
NEMO_LAUNCHER_DEBUG=1 python3 main.py \
...
# Remove NEMO_LAUNCHER_DEBUG to actually submit
python3 main.py \
...
Perform a Dry Run of the Launcher Configuration
To dry run the hydra configuration of the launcher:
# Original Command that submits to cluster
PYTHONPATH=$PWD python3 main.py \
training.model.global_batch_size=3
# (-c job): Prints yaml configuration without submitting
PYTHONPATH=$PWD python3 main.py \
training.model.global_batch_size=3 \
-c job
# (-c job -p <pkg>): Prints yaml configuration of just the cluster hydra package
PYTHONPATH=$PWD python3 main.py \
training.model.global_batch_size=3 \
-c job -p cluster
Note
Perform a Dry Run of the Launcher Configuration is helpful if you want to double check that your overrides provided on the command line are set correctly. Perform a Dry Run of the Submission is helpful if you want to inspect that the manifest to be submitted to the Kubernetes cluster is as expected.
Inspect the Argo Workflow Yaml
After running python main.py
, you can inspect the Argo Workflow yaml manifest that was previously submitted.
# Assuming current working directory is NeMo-Framework-Launcher/launcher_scripts
ls results/<run-name>/<experiment-name>/<stage>.yaml
# The exact subdirectory structure may differ; however, the yaml manifest will always contain
# `kind: Workflow`. Use the following command to find them:
fgrep -r 'kind: Workflow' results/