Run NeMo Framework on Kubernetes

NeMo Framework supports DGX A100 and H100-based Kubernetes (K8s) clusters with compute networking. Currently, we support NeMo stages such as data preparation, base model pre-training, PEFT, and NeMo Aligner for GPT-based models.

This document explains how to set up your K8s cluster and your local environment. It also provides examples of different use cases.

Prerequisites

This playbook downloads and installs additional third-party, open-source software projects. Before using them, review the license terms associated with these open-source projects.

Software Component Versions

This section identifies the software component versions and services that were validated in this project. While it is possible that other versions may function properly, we have not tested them and cannot guarantee complete compatibility or optimal performance.

We recommend that you verify the alternative versions for your specific use cases. To ensure broader compatibility, we continue to expand our testing to include new versions as they are released.

Software Component

Version

Kubernetes/kubectl CLI

v1.26.7

Helm

v3.13.2

Kubeflow/Training-Operator

v1.7.0

GPU Operator

v22.9.2

Argo Operator/argo CLI

v3.5.4

Network Operator

v23.5.0

Local Setup

Use the Kubernetes client CLI (kubectl) to access the cluster.

# Linux installation (stable)
mkdir -p ~/bin

KUBECTL_VERSION=stable  # change if needed
if [[ $KUBECTL_VERSION == stable ]]; then
  curl -L "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" -o ~/bin/kubectl
else
  curl -L https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o ~/bin/kubectl
fi

# Add this to your shell rc if not there already, e.g., .bashrc
export PATH="~/bin/:${PATH}"

For more information about installation, see https://kubernetes.io/docs/tasks/tools.

After downloading the client, you need to set up and configure your kubeconfig to access your cluster. Work with your cluster admin to set up your kubeconfig.

Note

If you use a Bright Cluster Manager (BCM) based K8s cluster, you do not need to install kubectl since it is already installed.

Use the Argo client CLI (argo) to monitor and access logs of the submitted workflows.

# Linux installation
mkdir -p ~/bin

ARGO_VERSION=v3.5.4  # change if needed
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/${ARGO_VERSION}/argo-linux-amd64.gz
gunzip argo-linux-amd64.gz
chmod +x argo-linux-amd64
mv ./argo-linux-amd64 ./argo

# Add this to your shell rc if not there already, e.g., .bashrc
export PATH="~/bin/:${PATH}"

For more information about installation, see https://github.com/argoproj/argo-workflows/releases.

Cluster Setup

Important

This document assumes that a Kubernetes cluster has been provisioned with DGX A100s or DGX H100s have been set as worker nodes with optional IB network interfaces. Some situations may require you to add or modify cluster-scoped resources. Work with your organization and cluster admin if you need to set them up.

Cluster Setup Tools

When setting up the cluster, you may need to use CLI tools. If you submit NeMo workloads after the cluster has been configured, you do not need to use them.

helm

helm is required to enable Kubernetes Operators to run the NeMo Framework on Kubernetes.

Note

If your cluster is already configured with the necessary operators, skip the ``helm``installation step and the remaining steps in the Cluster Setup. Proceed to Storage Options to complete the next step.

Note

Adding cluster-level resources is usually done by your cluster admin.

# Linux installation
mkdir -p ~/bin

HELM_VERSION=v3.13.2
tmp_helm_dir=$(mktemp -d 2>/dev/null || mktemp -d -t 'helm')
mkdir -p $tmp_helm_dir
wget -qO- https://get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz | tar -xz -C $tmp_helm_dir --strip-components=1
cp $tmp_helm_dir/helm ~/bin/helm
rm -rf $tmp_helm_dir

# Add this to your shell rc if not there already, e.g., .bashrc
export PATH="~/bin/:${PATH}"

Note

Helm has a version skew with Kubernetes. Ensure that you install the correct version that matches your K8s version.

For more information about installation, see https://helm.sh/docs/intro/install.

Cluster Operators

This section lists the required Kubernetes Operators and provides example installation instructions.

The GPU Operator manages the NVIDIA software components needed to provision GPUs to pods.

VERSION=v22.9.2
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  --version $VERSION \
  nvidia/gpu-operator --set driver.enabled=false

Note

The version of the GPU Operator you install may depend on your Kubernetes version. Verify that your Kubernetes version and operating system are supported in the GPU operator support matrix.

For more information about installation, see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html.

The Argo Operator provides the Workflow Custom Resource to enable control flow of steps on a K8s cluster.

VERSION=v3.5.4
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/${VERSION}/quick-start-minimal.yaml
kubectl apply -f argo-rbac.yaml

For more information about installation, see https://argo-workflows.readthedocs.io/en/latest/quick-start/#install-argo-workflows.

Warning

This playbook assumes that the Argo Operator can create resources in your namepsace under the default ServiceAccount. To grant the default ServiceAccount the necessary permissions for Argo workflows in a specific namespace, create the following Role and RoleBinding. Perform these steps in each namespace where you intend to run your applications.

MY_NAMESPACE=<...>
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: $MY_NAMESPACE
  name: all-resources-manager
rules:
- apiGroups: ["*"] # This specifies all API groups
  resources: ["*"] # This grants access to all resources
  verbs: ["*"] # This specifies the actions the role can perform

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: all-resources-manager-binding
  namespace: $MY_NAMESPACE
subjects:
- kind: ServiceAccount
  name: default # Name of the Service Account
  namespace: $MY_NAMESPACE
roleRef:
  kind: Role
  name: all-resources-manager
  apiGroup: rbac.authorization.k8s.io
EOF

NOTE: This action gives the default ServiceAccount elevated privileges in that namespace; however, it is not recommended for production clusters. When setting up Role-Based Access Control (RBAC) in your organization’s Kubernetes cluster, the cluster admin decides how to configure permissions. Work with your cluster admin to find the RBAC setup that works for your use case. For more information, refer to the Argo docs for their recommendations .

The Kubeflow Training Operator provides multi-node training on Kubernetes Custom Resources such as PytorchJob and MPIJob.

VERSION=v1.7.0
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=$VERSION"

For more information about installation, see https://github.com/kubeflow/training-operator.

The Network Operator provides an easy way to install, configure, and manage the lifecycle of the NVIDIA Mellanox Network Operator. It is only required if you have IB network interfaces.

VERSION=v23.5.0

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install network-operator nvidia/network-operator \
  -n nvidia-network-operator \
  --create-namespace \
  --version ${VERSION} \
  --wait

For more information about installation, see Vanilla Kubernetes Cluster.

Storage Options

Note

Setting up storage in a Kubernetes cluster is usually the responsibility of the cluster admin. Work with your cluster admin to understand the storage options that are available to you.

Currently, the following volume types are supported:

  • PersistentVolumeClaims (PVC)
    • Typically, in production or development clusters, StorageClasses allow users to allocate storage for their workloads without needing to configure the specific storage type. For example, an admin can setup NFS storage through a StorageClass.

    • The examples in this playbook use PersistentVolumeClaims, but you can use NFS or HostPath with any of these examples.

  • NFS
    • Easy to setup on Cloud Service Providers (CSP).

    • To use NFS storage directly in these examples, you need the IP address of the server.

  • HostPath
    • Suitable for use with single-node clusters or clusters where the path mounted is available on all the worker nodes.

If there is a different storage option you’d like to see supported, open an issue on NeMo-Framework-Launcher.

Set up PVC Storage

PersistentVolumeClaims (PVC) store your data, checkpoints, and results. To perform the setup procedure, it is assumed that your PVC is already in place.

The following example shows how to create a dynamic PVC from a StorageClass that was set up by your cluster admin. Replace STORAGE_CLASS=<...> with the name of your StorageClass.

This example requests 150Gi of space. Adjust this number for your workloads, but keep in mind that not all storage provisioners support volume resizing.

STORAGE_CLASS=<...>
PVC_NAME=nemo-workspace

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${PVC_NAME}
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ${STORAGE_CLASS}
  resources:
    requests:
      # Requesting enough storage for a few experiments
      storage: 150Gi
EOF

Note

The storage class must support ReadWriteMany because multiple pods may need to access the PVC to perform concurrent read and write operations.

Set up PVC Busybox Helper Pod

The process of inspecting the PVC and copying data to and from it are made easier with a busybox container. The following example assumes that the pod is set up for copying data to and from the PVC.

PVC_NAME=nemo-workspace
MOUNT_PATH=/nemo-workspace

kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nemo-workspace-busybox
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: workspace
      mountPath: ${MOUNT_PATH}
  volumes:
  - name: workspace
    persistentVolumeClaim:
      claimName: ${PVC_NAME}
EOF

If the container is no longer required, you can remove it, as it should use minimal resources when idle.

kubectl delete pod nemo-workspace-busybox

Set up Docker Secrets

To set up docker secrets, create a secret key on the K8s cluster to authenticate with the NGC private registry. If you have not done so already, get an NGC key from ngc.nvidia.com.

Create a secret key on the K8s cluster and replace <NGC KEY HERE> with your NGC secret key. In addition, if you have any special characters in your key, wrap the key in single quotes (') so it can be parsed correctly by K8s.

kubectl create secret docker-registry ngc-registry --docker-server=nvcr.io --docker-username=\$oauthtoken --docker-password=<NGC KEY HERE>

The name of this secret (ngc-registry) is the default used by the launcher. To configure a different pull secret, add the following commands to your launcher invocation:

python3 main.py \
   ...
   cluster=k8s_v2 \
   cluster.pull_secret=<my-other-pull-secret> \
   ...

NeMo Stages

To get started, make a copy of the NeMo-Framework-Launcher on the same node where the kubectl binary and argo are located. You will use it to access your K8s cluster.

git clone https://github.com/NVIDIA/NeMo-Framework-Launcher

Set up a virtual environment:

cd NeMo-Framework-Launcher
python3 -m venv venv
source activate venv/bin/activate
pip install -r requirements.txt

Navigate to the launcher_scripts directory:

cd launcher_scripts

The following example downloads and preprocesses The Pile dataset onto the nemo-workspace PVC using CPU resources:

PVC_NAME=nemo-workspace     # Must already exist
MOUNT_PATH=/nemo-workspace  # Path within the container

PYTHONPATH=$PWD python main.py \
    launcher_scripts_path=$PWD \
    data_dir=$MOUNT_PATH/pile \
    \
    cluster=k8s_v2 \
    cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
    cluster.volumes.workspace.mount_path=$MOUNT_PATH \
    \
    'stages=[data_preparation]' \
    data_preparation=gpt3/download_gpt3_pile \
    data_preparation.file_numbers="0-29" \
    data_preparation.run.node_array_size=30 \
    data_preparation.run.bcp_preproc_npernode=2 \
    \
    env_vars.TRANSFORMERS_OFFLINE=0

# Should see the following message from stdout:
# workflow.argoproj.io/pile-prep-ldj45 created
  • launcher_scripts_path=$PWD: Must be set for all launcher commands.

  • data_dir=$MOUNT_PATH/pile: Container location of downloaded and preprocessed data. Ensure the volume backing has enough storage. If this path is not present in one of the cluster.volumes.*.mount_path, an error will be raised.

  • cluster=k8s_v2: Selects Kubernetes launcher type.

  • cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME: Name of previously created PVC.

  • cluster.volumes.workspace.mount_path=$MOUNT_PATH: Mounts the workspace volume into all containers at this path.

  • 'stages=[data_preparation]': Selects only the data preparation stage.

  • data_preparation=gpt3/download_gpt3_pile: Selects the scripts and procedure for data preparation.

  • data_preparation.file_numbers="0-29": Set 0-29 to something lower, such as 0-1 if you don’t need the full dataset (or would like to iterate quickly).

  • data_preparation.run.node_array_size=30: Controls the number of workers. Set it to the same number of file_numbers.

  • data_preparation.run.bcp_preproc_npernode=2: Controls how many processes per workers. (Note: While the name suggests BCP, this is a platform-agnostic config option).

To explore other data_preparation.* options, refer config source.

Depending on the number of nodes requested, this operation can take a few hours to download all 30 shards of The Pile dataset, extract each shard, and then preprocess the extracted files.

After submitting the job, you can monitor the overall progress:

argo watch @latest
# Or you can reference the workflow directly
argo watch pile-prep-ldj45

This action opens a full-screen status that updates periodically.

ServiceAccount:      unset (will run with the default ServiceAccount)
Name:                pile-prep-ldj45
Namespace:           frameworks
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          False
Created:             Mon Mar 18 11:53:29 -0700 (5 minutes ago)
Started:             Mon Mar 18 11:53:29 -0700 (5 minutes ago)
Duration:            5 minutes 57 seconds
Progress:            2/3
ResourcesDuration:   9s*(1 cpu),9s*(100Mi memory)

STEP                    TEMPLATE            PODNAME                                        DURATION  MESSAGE
 ● pile-prep-ldj45      data-steps
 ├───✔ download-merges  download-tokenizer  pile-prep-ldj45-download-tokenizer-1159852991  5s
 ├───✔ download-vocab   download-tokenizer  pile-prep-ldj45-download-tokenizer-3211248646  6s
 └───● mpijob           pile-prep-          pile-prep-ldj45-pile-prep--227527442           5m

The STEP named mpijob may take a while because it is downloading, extracting, and preprocessing the data. You can monitor the progress of the MPIJob and inspect its logs by using the following command, which tracks the logs of the launcher containers:

argo logs @latest -f -c mpi-launcher
# Or you can reference the workflow directly
argo logs pile-prep-ldj45 -f -c mpi-launcher

Once the job is finished, you will see Status: Succeeded in argo watch @latest.

Finally, use the nemo-workspace-busybox pod to view the final processed files. Here is an example output assuming file_numbers="0-1".

kubectl exec nemo-workspace-busybox -- ls -lh /nemo-workspace/pile
# total 37G
# drwxrwxrwx    2 root     root           2 Mar 18 18:53 bpe
# -rw-r--r--    1 root     root       18.3G Mar 18 19:25 my-gpt3_00_text_document.bin
# -rw-r--r--    1 root     root      112.5M Mar 18 19:25 my-gpt3_00_text_document.idx
# -rw-r--r--    1 root     root       18.3G Mar 18 19:24 my-gpt3_01_text_document.bin
# -rw-r--r--    1 root     root      112.6M Mar 18 19:24 my-gpt3_01_text_document.idx

To inspect the logs of the other steps:

# To inspect logs of `download-merges`
argo logs @latest -f pile-prep-ldj45-download-tokenizer-1159852991
# To inspect logs of `download-merges`
argo logs @latest -f pile-prep-ldj45-download-tokenizer-3211248646

Afterwards, clean up the workflow:

argo delete pile-prep-ldj45
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest

The following example shows how to pre-train GPT-3 1B with The Pile dataset. Go to the “Data Prep” tab for information on how to generate The Pile dataset used in this training.

PVC_NAME=nemo-workspace     # Must already exist
MOUNT_PATH=/nemo-workspace  # Path within the container

PYTHONPATH=$PWD python main.py \
    launcher_scripts_path=$PWD \
    data_dir=/$MOUNT_PATH/pile \
    \
    cluster=k8s_v2 \
    cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
    cluster.volumes.workspace.mount_path=$MOUNT_PATH \
    \
    'stages=[training]' \
    training=gpt3/1b_improved \
    "training.exp_manager.explicit_log_dir=$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results" \
    \
    "training.model.data.data_prefix=[0.5,$MOUNT_PATH/pile/my-gpt3_00_text_document,0.5,$MOUNT_PATH/pile/my-gpt3_01_text_document]" \
    training.trainer.num_nodes=8 \
    training.trainer.devices=8 \
    training.trainer.max_steps=300000 \
    training.trainer.val_check_interval=2000 \
    training.model.global_batch_size=512

# Should see the following message from stdout:
# workflow.argoproj.io/training-tmlzw created
  • launcher_scripts_path=$PWD: Must be set for all launcher commands.

  • data_dir=$MOUNT_PATH/pile: Container location of previously downloaded and preprocessed data. If this path is not present in one of the cluster.volumes.*.mount_path, an error will be raised.

  • cluster=k8s_v2: Selects Kubernetes launcher type.

  • cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME: Name of previously created PVC.

  • cluster.volumes.workspace.mount_path=$MOUNT_PATH: Mounts the workspace volume into all containers at this path.

  • 'stages=[training]': Selects only the training stage.

  • training=gpt3/1b_improved: Specifies the model type and size you want to train. Explore other configs here.

  • "training.exp_manager.explicit_log_dir=/$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results": Controls where logs and checkpoints go. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • "training.model.data.data_prefix=[0.5,/$MOUNT_PATH/pile/my-gpt3_00_text_document,0.5,/$MOUNT_PATH/pile/my-gpt3_01_text_document]": List of proportions and files the training job should use. Note that the file path prefixes should all be present in one of the cluster.volumes.*.mount_path or an error will be raised. In this example we only use two.

  • training.trainer.num_nodes=8: Controls the number of workers to use. (Note: K8s scheduler may end up assigning two workers on same node if devices<=4 and you have 8 GPUs per node).

  • training.trainer.devices=8: Controls how many GPUs per worker to use.

  • training.trainer.max_steps=300000: Controls the max training steps.

  • training.trainer.val_check_interval=2000: Sets the validation interval. Should be less than or equal to, and a multiple of max_steps.

  • training.model.global_batch_size=512: Control Global Batch Size. May need to change if using fewer num_nodes and devices.

Tip

You can append -c job to your python main.py invocation to view the job config before submitting it. It’s also helpful to view the defaults of a training job before specifying your own overrides.

After submitting the job, you can monitor the overall progress:

argo watch @latest
# Or you can reference the workflow directly
argo watch training-tmlzw

This action opens a full-screen status that updates periodically.

Name:                training-tmlzw
Namespace:           frameworks
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          True
Created:             Tue Mar 19 09:07:41 -0700 (1 minute ago)
Started:             Tue Mar 19 09:07:41 -0700 (1 minute ago)
Duration:            1 minute 50 seconds
Progress:            0/1

STEP               TEMPLATE        PODNAME                              DURATION  MESSAGE
 ● training-tmlzw  training-steps
 └───● pytorchjob  training-       training-tmlzw-training--1520170578  1m

The STEP named pytorchjob may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers:

# See the logs of all the workers
argo logs @latest -f -c pytorch
# See the logs of second worker
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly
argo logs training-tmlzw -f -c pytorch

Once the job is finished, you will see Status: Succeeded in argo watch @latest.

Finally, use the nemo-workspace-busybox pod to view the final checkpoints, logs, and results:

kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/gpt_1b_improved/training_gpt_1b_improved/results
# total 114K
# drwxr-xr-x    4 root     root           9 Mar 19 16:11 .
# drwxr-xr-x    3 root     root           1 Mar 19 16:08 ..
# drwxr-xr-x    4 root     root           2 Mar 19 16:15 checkpoints
# -rw-r--r--    1 root     root         106 Mar 19 16:11 cmd-args.log
# -rw-r--r--    1 root     root       11.3K Mar 19 16:13 events.out.tfevents.1710864691.training-hlwd8-worker-0.94.0
# -rw-r--r--    1 root     root          54 Mar 19 16:11 git-info.log
# -rw-r--r--    1 root     root        3.1K Mar 19 16:11 hparams.yaml
# -rw-r--r--    1 root     root        1.0K Mar 19 16:15 lightning_logs.txt
# -rw-r--r--    1 root     root       17.6K Mar 19 16:12 nemo_error_log.txt
# -rw-r--r--    1 root     root       33.0K Mar 19 16:13 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x    2 root     root           9 Mar 19 16:11 run_0

Afterwards, clean up the workflow:

argo delete training-tmlzw
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest

The following example shows how to apply the LoRa PEFT method to the LLaMA2 7b model. The hyperparameters and configuration assume four units of the NVIDIA A100 Tensor Core GPU with 80GB of high-bandwidth memory (4 A100-80G).

Note

To download checkpoints, use git-lfs. Go to https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing for installation instructions applicable for your environment.

Additionally, the LLaMA checkpoints require you to seek approval from Meta. After obtaining access, you can clone them using your Hugging Face username and access token as the password. Create your access token here.

First, download the Hugging Face checkpoint:

# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Second, convert the checkpoint to .nemo format:

NEMO_IMAGE=nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.11
docker run --rm -it -v $LOCAL_WORKSPACE:/workspace $NEMO_IMAGE \
    python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
    --in-file /workspace/Llama-2-7b-chat-hf \
    --out-file /workspace/llama2_7b.nemo

To see other prepared NeMo models, go to:

Third, copy it into the nemo-workspace PVC. In this step, use kubectl exec instead of kubectl cp since .nemo is not preserved correctly over kubectl cp.

( cd $LOCAL_WORKSPACE; tar cf - llama2_7b.nemo | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )

In the final step, submit the training job. This example also downloads the squad dataset into the PVC by default.

PVC_NAME=nemo-workspace     # Must already exist
MOUNT_PATH=/nemo-workspace  # Path within the container

PYTHONPATH=$PWD python main.py \
    launcher_scripts_path=$PWD \
    data_dir=$MOUNT_PATH \
    \
    cluster=k8s_v2 \
    cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
    cluster.volumes.workspace.mount_path=$MOUNT_PATH \
    \
    stages=[peft] \
    peft=llama/squad \
    peft.run.name=llama-7b-peft-lora \
    peft.exp_manager.explicit_log_dir="$MOUNT_PATH/\${peft.run.model_train_name}/peft_\${peft.run.name}/results" \
    peft.trainer.num_nodes=1 \
    peft.trainer.devices=4 \
    peft.trainer.max_epochs=null \
    peft.trainer.max_steps=2000 \
    peft.model.global_batch_size=128 \
    peft.model.micro_batch_size=1 \
    peft.model.restore_from_path=$MOUNT_PATH/llama2_7b.nemo

    # Should see the following message from stdout:
    # workflow.argoproj.io/peft-qqkxh created

Note

If you use an image earlier than 23.11, add this additional flag peft.model.megatron_amp_O2=false.

  • launcher_scripts_path=$PWD: Must be set for all launcher commands.

  • data_dir=$MOUNT_PATH: Container location of where squad is downloaded and referenced. If this path is not present in one of the cluster.volumes.*.mount_path, an error will be raised.

  • cluster=k8s_v2: Selects Kubernetes launcher type.

  • cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME: Name of previously created PVC.

  • cluster.volumes.workspace.mount_path=$MOUNT_PATH: Mounts the workspace volume into all containers at this path.

  • 'stages=[peft]': Selects only the PEFT stage.

  • peft=llama/squad: Specifies the model type and size you want to train. Explore other configs here.

  • peft.run.name=llama-7b-peft-lora: Name of this run; it can be anything. Used as part of the local result dir path and the remote explicit_log_dir path.

  • peft.exp_manager.explicit_log_dir="$MOUNT_PATH/\${peft.run.model_train_name}/peft_\${peft.run.name}/results": Controls where logs and checkpoints go. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • peft.trainer.num_nodes=1: Controls the number of workers to use (Note: K8s scheduler may end up assigning two workers on same node if devices<=4 and you have 8 GPUs per node).

  • peft.trainer.devices=4: Controls how many GPUs per worker to use.

  • peft.trainer.max_epochs=null: Rely on max_steps instead.

  • peft.trainer.max_steps=2000: Controls the max training steps.

  • peft.model.global_batch_size=128: Control Global Batch Size. May need to change if using fewer num_nodes and devices.

  • peft.model.micro_batch_size=1: Controls Micro Batch Size.

  • peft.model.restore_from_path=$MOUNT_PATH/llama2_7b.nemo: Path to pre-trained Nemo LLaMA2 model. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

Tip

You can append -c job to your python main.py ... invocation to view the job config before submitting it. It’s also helpful to view the defaults of a training job before specifying your own overrides.

After submitting the job, you can monitor the overall progress:

argo watch @latest
# Or you can reference the workflow directly
argo watch peft-qqkxh

This action opens a full-screen status that updates periodically.

Name:                peft-qqkxh
Namespace:           frameworks
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          False
Created:             Mon Mar 18 17:21:13 -0700 (5 minutes ago)
Started:             Mon Mar 18 17:21:13 -0700 (5 minutes ago)
Duration:            5 minutes 8 seconds
Progress:            1/2
ResourcesDuration:   19s*(1 cpu),19s*(100Mi memory)

STEP                   TEMPLATE        PODNAME                               DURATION  MESSAGE
 ● peft-qqkxh          peft-steps
 ├───✔ download-squad  download-squad  peft-qqkxh-download-squad-1305096357  12s
 └───● pytorchjob      peft-           peft-qqkxh-peft--105064823            4m

The STEP named pytorchjob may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers:

# See the logs of all the workers
argo logs @latest -f -c pytorch
# See the logs of second worker
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly
argo logs peft-qqkxh -f -c pytorch

Once the job is finished, you will see Status: Succeeded in argo watch @latest.

Finally, use the nemo-workspace-busybox pod to view the final checkpoints, logs, and results:

kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/llama2_7b/peft_llama-7b-peft-lora/results
# total 280K
# drwxr-xr-x   31 root     root          36 Mar 19 01:04 .
# drwxr-xr-x    3 root     root           1 Mar 19 00:22 ..
# drwxr-xr-x    2 root     root           3 Mar 19 03:52 checkpoints
# -rw-r--r--    1 root     root         113 Mar 19 00:39 cmd-args.log
# -rw-r--r--    1 root     root      153.6K Mar 19 03:51 events.out.tfevents.1710809120.peft-86b72-worker-0.4232.0
# -rw-r--r--    1 root     root          54 Mar 19 00:39 git-info.log
# -rw-r--r--    1 root     root        6.0K Mar 19 00:45 hparams.yaml
# -rw-r--r--    1 root     root        3.2K Mar 19 03:52 lightning_logs.txt
# -rw-r--r--    1 root     root       20.6K Mar 19 00:45 nemo_error_log.txt
# -rw-r--r--    1 root     root       40.7K Mar 19 03:52 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x    2 root     root           2 Mar 19 00:22 run_0

To inspect the logs from the other steps:

# To inspect logs of `download-squad`
argo logs @latest -f peft-qqkxh-download-squad-1305096357

Afterwards, can clean up the workflow:

argo delete peft-qqkxh
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest

The following example shows how to fine-tune GPT-3 2B with the Anthropic-HH dataset.

Note

To learn more about Reward Model training, visit the Model Alignment by RLHF documentation.

First, download the Anthropic-HH dataset onto the PVC. In this step, you download it locally and then upload it to the PVC through the nemo-workspace-busybox pod:

# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>
NEMO_IMAGE=nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.11
docker run --rm -it -v $LOCAL_WORKSPACE/anthropic-rm-data:/workspace $NEMO_IMAGE \
    python /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/dataprep_scripts/anthropichh_dataprep/download_and_process.py \
    --output-dir /workspace

# Then upload to the PVC
( cd $LOCAL_WORKSPACE; tar cf - anthropic-rm-data | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )

Second, check that the data was uploaded:

kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/anthropic-rm-data
# total 428M
# drwxr-xr-x    2 root     root           4 Mar 19 18:11 .
# drwxr-xr-x    7 root     root           6 Mar 19 18:11 ..
# -rw-r--r--    1 root     root       16.1M Mar 19 18:10 test_comparisons.jsonl
# -rw-r--r--    1 root     root        5.5M Mar 19 18:10 test_prompts.jsonl
# -rw-r--r--    1 root     root      301.2M Mar 19 18:10 train_comparisons.jsonl
# -rw-r--r--    1 root     root      101.9M Mar 19 18:10 train_prompts.jsonl

Next, download the pre-trained GPT 2B NeMo checkpoint:

cd $LOCAL_WORKSPACE
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
mkdir 2b_model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C 2b_model_checkpoint

Note

For more information about pretrained models, see Prerequisite: Obtaining a pretrained model

After downloading the checkpoint, convert it to the Megatron Core format and then upload it to the PVC:

# Converts checkpoint to $LOCAL_WORKSPACE/2b_mcore_gpt.nemo
docker run --rm -it \
    -v $LOCAL_WORKSPACE/2b_model_checkpoint:/inputs \
    -v $LOCAL_WORKSPACE:/outputs \
    nvcr.io/nvidia/nemo:24.05 bash -exc "
    git -C /opt/NeMo checkout 438db620bffdf4e2d4cef6368d0e86be2a02b7c3
    python /opt/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py \
        --in-folder /inputs --out-file /outputs/2b_mcore_gpt.nemo --cpu-only"

# Validate the local file
ls -lah $LOCAL_WORKSPACE/2b_mcore_gpt.nemo
# -rw-r--r-- 1 root root 4.3G Mar 19 14:08 /tmp/2b_mcore_gpt.nemo

# Then upload to the PVC
( cd $LOCAL_WORKSPACE; tar cf - 2b_mcore_gpt.nemo | kubectl exec -i nemo-workspace-busybox -- tar xf - -C /nemo-workspace )

After the checkpoint uploads, launch the fine-tuning job:

PVC_NAME=nemo-workspace     # Must already exist
MOUNT_PATH=/nemo-workspace  # Path within the container

PYTHONPATH=$PWD python main.py \
    launcher_scripts_path=$PWD \
    container=nvcr.io/nvidia/nemo:24.05 \
    \
    cluster=k8s_v2 \
    cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
    cluster.volumes.workspace.mount_path=$MOUNT_PATH \
    \
    'stages=[rlhf_rm]' \
    rlhf_rm=gpt3/2b_rm \
    "rlhf_rm.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_rm.run.name}/rlhf_rm_\${rlhf_rm.run.name}/results" \
    "rlhf_rm.model.data.data_prefix={train: [$MOUNT_PATH/anthropic-rm-data/train_comparisons.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl]}" \
    rlhf_rm.model.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/2b_mcore_gpt.nemo \
    rlhf_rm.trainer.num_nodes=8 \
    rlhf_rm.trainer.devices=8 \
    rlhf_rm.trainer.rm.max_epochs=1 \
    rlhf_rm.trainer.rm.max_steps=-1 \
    rlhf_rm.trainer.rm.val_check_interval=100 \
    rlhf_rm.model.global_batch_size=64

# Should see the following message from stdout:
# workflow.argoproj.io/rlhf-rm-d65f4 created
  • launcher_scripts_path=$PWD: Must be set for all launcher commands.

  • container=nvcr.io/nvidia/nemo:24.05: The docker image used for training. Visit here to choose other framework containers.

  • cluster=k8s_v2: Selects Kubernetes launcher type.

  • cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME: Name of previously created PVC.

  • cluster.volumes.workspace.mount_path=$MOUNT_PATH: Mounts the workspace volume into all containers at this path.

  • 'stages=[rlhf_rm]': Selects only the RLHF Reward Model stage.

  • rlhf_rm=gpt3/2b_rm: Specifies the model type and size you want to train. Explore other configs here.

  • "rlhf_rm.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_rm.run.name}/rlhf_rm_\${rlhf_rm.run.name}/results": Controls where logs and checkpoints go. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • "rlhf_rm.model.data.data_prefix={train: [$MOUNT_PATH/anthropic-rm-data/train_comparisons.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_comparisons.jsonl]}": Train/Validation/Test splits.

  • rlhf_rm.model.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/2b_mcore_gpt.nemo: Path to pre-trained 2B model. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • rlhf_rm.trainer.num_nodes=8: Controls the number of workers to use (Note: K8s scheduler may end up assigning two workers on same node if devices<=4 and you have 8 GPUs per node).

  • rlhf_rm.trainer.devices=8: Controls how many GPUs per worker to use.

  • rlhf_rm.trainer.rm.max_epochs=1: Sets the maximum epochs to train.

  • rlhf_rm.trainer.rm.max_steps=-1: -1 to go through all dataset.

  • rlhf_rm.trainer.rm.val_check_interval=100: Sets the validation interval.

  • rlhf_rm.model.global_batch_size=64: Control Global Batch Size. May need to change if using fewer num_nodes and devices.

Tip

You can append -c job to your python main.py ... invocation to view the job config before submitting. It’s also helpful to view the defaults of a training job before specifying your own overrides.

After submitting the job, you can monitor the overall progress:

argo watch @latest
# Or you can reference the workflow directly
argo watch rlhf-rm-d65f4

This action opens a full-screen status that updates periodically.

Name:                rlhf-rm-d65f4
Namespace:           frameworks
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          True
Created:             Tue Mar 19 14:35:22 -0700 (53 seconds ago)
Started:             Tue Mar 19 14:35:22 -0700 (53 seconds ago)
Duration:            53 seconds
Progress:            0/1

STEP               TEMPLATE       PODNAME                           DURATION  MESSAGE
 ● rlhf-rm-d65f4   rlhf-rm-steps
 └───● pytorchjob  rlhf-rm-       rlhf-rm-d65f4-rlhf-rm--853469757  53s

The STEP named pytorchjob may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers:

# See the logs of all the workers
argo logs @latest -f -c pytorch
# See the logs of second worker
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly
argo logs rlhf-rm-d65f4 -f -c pytorch

Once the job is finished, you will see Status: Succeeded in argo watch @latest.

Finally, use the nemo-workspace-busybox pod to view the final checkpoints, logs, and results:

kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results
# total 109K
# drwxr-xr-x   10 root     root          14 Mar 19 21:50 .
# drwxr-xr-x    3 root     root           1 Mar 19 21:35 ..
# drwxr-xr-x    4 root     root           3 Mar 19 21:50 checkpoints
# -rw-r--r--    1 root     root         104 Mar 19 21:45 cmd-args.log
# -rw-r--r--    1 root     root       18.9K Mar 19 21:47 events.out.tfevents.1710884746.rlhf-rm-bc6sw-worker-0.94.0
# -rw-r--r--    1 root     root        4.9K Mar 19 21:50 hparams.yaml
# -rw-r--r--    1 root     root         987 Mar 19 21:47 lightning_logs.txt
# -rw-r--r--    1 root     root       18.9K Mar 19 21:45 nemo_error_log.txt
# -rw-r--r--    1 root     root       29.0K Mar 19 21:45 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x    2 root     root           5 Mar 19 21:36 run_0

Afterwards, clean up the workflow:

argo delete rlhf-rm-d65f4
# Or if you are sure this is the latest workflow, then you cab run
argo delete @latest

The following example shows PPO Training with GPT 2B and the Anthropic-HH dataset. The RLHF Reward Model is a prerequisite for this stage since a Reward Model checkpoint is a requirement to initialize the actor and the critic. It is also assumed that the Anthropic-HH dataset is prepared and uploaded to the PVC.

Note

To learn more about PPO training, see Model Alignment by RLHF documentation.

PVC_NAME=nemo-workspace     # Must already exist
MOUNT_PATH=/nemo-workspace  # Path within the container

PYTHONPATH=$PWD python main.py \
    launcher_scripts_path=$PWD \
    container=nvcr.io/nvidia/nemo:24.05 \
    \
    cluster=k8s_v2 \
    cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
    cluster.volumes.workspace.mount_path=$MOUNT_PATH \
    \
    stages=[rlhf_ppo] \
    rlhf_ppo=gpt3/2b_ppo \
    rlhf_ppo.critic.trainer.num_nodes=1 \
    rlhf_ppo.critic.trainer.devices=8 \
    rlhf_ppo.critic.model.global_batch_size=64 \
    rlhf_ppo.critic.model.tensor_model_parallel_size=1 \
    rlhf_ppo.critic.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo \
    rlhf_ppo.critic.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/critic_results \
    rlhf_ppo.actor.trainer.num_nodes=1 \
    rlhf_ppo.actor.trainer.devices=8 \
    rlhf_ppo.actor.trainer.ppo.max_steps=-1 \
    rlhf_ppo.actor.model.global_batch_size=64 \
    rlhf_ppo.actor.model.tensor_model_parallel_size=1 \
    rlhf_ppo.actor.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo \
    rlhf_ppo.actor.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/actor_results \
    rlhf_ppo.actor.model.data.data_prefix="{train: [$MOUNT_PATH/anthropic-rm-data/train_prompts.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl]}"

# Should see the following message from stdout:
# workflow.argoproj.io/rlhf-ppo-np8vr created
  • launcher_scripts_path=$PWD: Must be set for all launcher commands.

  • container=nvcr.io/nvidia/nemo:24.05: The docker image used for training. Go to here to choose other framework containers.

  • cluster=k8s_v2: Selects Kubernetes launcher type.

  • cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME: Name of previously created PVC.

  • cluster.volumes.workspace.mount_path=$MOUNT_PATH: Mounts the workspace volume into all containers at this path.

  • 'stages=[rlhf_ppo]': Selects only the RLHF PPO stage.

  • rlhf_ppo=gpt3/2b_ppo: Specifies the model type and size you want to train. Explore other configs here.

  • rlhf_ppo.critic.trainer.num_nodes=1: Controls the number of workers to use (Note: K8s scheduler may end up assigning two workers on same node if devices<=4 and you have 8 GPUs per node).

  • rlhf_ppo.critic.trainer.devices=8: Controls how many GPUs per worker to use.

  • rlhf_ppo.critic.model.global_batch_size=64: Control Global Batch Size. You may need to change this value if using fewer num_nodes and devices.

  • rlhf_ppo.critic.model.tensor_model_parallel_size=1:

  • rlhf_ppo.critic.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo: Path to a pre-trained 2B Reward Model. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • rlhf_ppo.critic.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/critic_results: Controls where logs and checkpoints go. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • rlhf_ppo.actor.trainer.num_nodes=1: Controls how many workers to use (Note: K8s scheduler may end up assigning two workers on same node if devices<=4 and you have 8 GPUs per node).

  • rlhf_ppo.actor.trainer.devices=8: Controls how many GPUs per worker to use.

  • rlhf_ppo.actor.trainer.ppo.max_steps=-1: Max PPO steps (-1 to go through the whole train set).

  • rlhf_ppo.actor.model.global_batch_size=64: Control Global Batch Size. May need to change if using fewer num_nodes and devices.

  • rlhf_ppo.actor.model.tensor_model_parallel_size=1.

  • rlhf_ppo.actor.pretrained_checkpoint.restore_from_path=$MOUNT_PATH/rlhf_rm_2b/rlhf_rm_rlhf_rm_2b/results/checkpoints/megatron_gpt.nemo: Path to a pre-trained 2B Reward Model. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • rlhf_ppo.actor.exp_manager.explicit_log_dir=$MOUNT_PATH/\${rlhf_ppo.run.name}/actor_results: Controls where logs and checkpoints go. If this path is not prefixed by one of cluster.volumes.*.mount_path, an error will be raised.

  • rlhf_ppo.actor.model.data.data_prefix="{train: [$MOUNT_PATH/anthropic-rm-data/train_prompts.jsonl], validation: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl], test: [$MOUNT_PATH/anthropic-rm-data/test_prompts.jsonl]}": Train/Validation/Test splits.

Tip

You can append -c job to your python main.py ... invocation to view the job config before submitting it. It’s also helpful to view the defaults of a training job before specifying your own overrides.

After submitting the job, you can monitor the overall progress:

argo watch @latest
# Or you can reference the workflow directly
argo watch rlhf-ppo-np8vr

This action opens a full-screen status that updates periodically.

Name:                rlhf-ppo-np8vr
Namespace:           frameworks
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          False
Created:             Tue Mar 19 15:34:48 -0700 (2 minutes ago)
Started:             Tue Mar 19 15:34:48 -0700 (2 minutes ago)
Duration:            2 minutes 57 seconds
Progress:            1/2
ResourcesDuration:   1s*(100Mi memory),1s*(1 cpu)

STEP                    TEMPLATE           PODNAME                                      DURATION  MESSAGE
 ● rlhf-ppo-np8vr       rlhf-ppo-steps
 ├─✔ critic-            critic-            rlhf-ppo-np8vr-critic--2352952381            3s
 ├─✔ actor-             actor-             rlhf-ppo-np8vr-actor--892161450              2m
 └─✔ delete-pytorchjob  delete-pytorchjob  rlhf-ppo-np8vr-delete-pytorchjob-1202150566  2s

The STEP named actor- may take a while. You can monitor the progress of the PyTorchJob and inspect its logs by using the following command, which tracks the logs of the worker containers.

# See the logs of all the workers (actor & critic)
argo logs @latest -f -c pytorch
# See the logs of second worker (actor & critic)
argo logs @latest -f -c pytorch -l training.kubeflow.org/replica-index=1
# Or you can reference the workflow directly (actor & critic)
argo logs rlhf-ppo-np8vr -f -c pytorch
# See the logs of second actor critic (pod name from `argo watch ...`)
argo logs @latest rlhf-ppo-np8vr-actor--3744553305 -f -c pytorch -l training.kubeflow.org/replica-index=1

Once the actor is finished, the critic PytorchJob will be deleted and you will see:

Status: Succeeded in argo watch @latest.

Note

Once the actor finishes, the critic job is deleted. This means the critic logs cannot be queried with kubectl logs or argo logs, but you can still find the logs on the PVC under the path:

kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_2b_actor_2b_critic/critic_results

Finally, use the nemo-workspace-busybox pod to view the final checkpoints, logs, and results:

kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_2b_actor_2b_critic/{actor,critic}_results
# /nemo-workspace/rlhf_2b_actor_2b_critic/actor_results:
# total 257K
# drwxr-xr-x   18 root     root          22 Mar 20 01:56 .
# drwxr-xr-x    4 root     root           2 Mar 19 23:19 ..
# drwxr-xr-x    4 root     root           3 Mar 20 01:56 checkpoints
# -rw-r--r--    1 root     root         217 Mar 19 23:30 cmd-args.log
# -rw-r--r--    1 root     root      132.3K Mar 20 01:53 events.out.tfevents.1710891065.actor-7jxt7-worker-0.94.0
# -rw-r--r--    1 root     root        5.7K Mar 20 01:56 hparams.yaml
# -rw-r--r--    1 root     root        1.0K Mar 20 01:53 lightning_logs.txt
# -rw-r--r--    1 root     root       20.2K Mar 19 23:31 nemo_error_log.txt
# -rw-r--r--    1 root     root       33.8K Mar 19 23:31 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x    2 root     root           5 Mar 19 23:20 run_0
#
# /nemo-workspace/rlhf_2b_actor_2b_critic/critic_results:
# total 163K
# drwxr-xr-x   54 root     root          57 Mar 20 01:53 .
# drwxr-xr-x    4 root     root           2 Mar 19 23:19 ..
# drwxr-xr-x    3 root     root           1 Mar 20 01:53 checkpoints
# -rw-r--r--    1 root     root         124 Mar 19 23:30 cmd-args.log
# -rw-r--r--    1 root     root       62.6K Mar 20 00:39 events.out.tfevents.1710891057.critic-j2tdp-worker-0.94.0
# -rw-r--r--    1 root     root         771 Mar 19 23:30 lightning_logs.txt
# -rw-r--r--    1 root     root       19.1K Mar 19 23:33 nemo_error_log.txt
# -rw-r--r--    1 root     root       28.5K Mar 20 01:53 nemo_log_globalrank-0_localrank-0.txt
# drwxr-xr-x    2 root     root           5 Mar 19 22:36 run_0

Afterwards, clean up the workflow:

argo delete rlhf-ppo-np8vr
# Or if you are sure this is the latest workflow, then you can run
argo delete @latest

Warning

  • There is a known issue where the critic may not terminate and release resources because the actor has reached its backoff limit. To workaround this issue, delete the workflow via argo delete rlhf-ppo-np8vr or argo delete @latest.

  • There is a known issue where the actor job may hang after the training loop is finished. Before deleting the PPO Workflow, check that the final checkpoint and results are as expected with kubectl exec nemo-workspace-busybox -- ls -lah /nemo-workspace/rlhf_2b_actor_2b_critic/actor_results and then delete the workflow argo delete rlhf-ppo-np8vr or argo delete @latest.

Advanced Use Cases

Download Data from PVC

To download data from your PVC, use the nemo-workspace-busybox pod created earlier:

# Replace <...> with a path on your local machine
LOCAL_WORKSPACE=<...>

# Tar will fail if LOCAL_WORKSPACE doesn't exist
mkdir -p $LOCAL_WORKSPACE

# Copy file in PVC at /nemo-workspace/foobar.txt to local file-system at $LOCAL_WORKSPACE/nemo-workspace/foobar.txt
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/foobar.txt | tar xf - -C $LOCAL_WORKSPACE

# Copy directory in PVC /nemo-workspace/fizzbuzz to local file-system at $LOCAL_WORKSPACE/fizzbuzz
kubectl exec nemo-workspace-busybox -- tar cf - /nemo-workspace/fizzbuzz | tar xf - -C $LOCAL_WORKSPACE

Multiple Storage Volumes

The examples used in this playbook assume that one PVC holds all of the data. If your cluster setup has data distributed over different volumes, you can add them to the configuration. The launcher will then mount them into all the child pods.

# Here is an example that attaches an additional PVC with name "nemo-data" to the stage's pods.
# The choice of `my-data-pvc` is arbitrary and is used to identify this volume when constructing
# the workflow yaml manifest.
PYTHONPATH=$PWD python3 main.py \
    ... \
    +cluster.volumes.my-data-pvc.mount_path=/mnt/data \
    +cluster.volumes.my-data-pvc.persistent_volume_claim=nemo-data

Configure the Container

Go to here for the full list of NeMo Framework containers.

To configure the launcher with a different container, add the following:

PYTHONPATH=$PWD python3 main.py \
    ... \
    container=<other-container>

Use IB Interfaces

If you deployed the NVIDIA Network Operator with IB devices, you can configure your workloads to use them. The following example shows how to request one nvidia.com/resibp12s0 resource and two nvidia.com/resibp186s0 resources.

PYTHONPATH=$PWD python3 main.py \
    ... \
    cluster=k8s_v2 \
    cluster.ib_interfaces.annotation=\'ibp12s0,ibp186s0\' \
    +cluster.ib_interfaces.resources="{nvidia.com/resibp12s0: 1, nvidia.com/resibp186s0: 2}"

Debugging Tips

Add Linux Capabilities

In certain scenarios, you might consider incorporating Linux functionalities into the pods for debugging purposes.

To add Linux capabilities to the launched MPIJob and PytorchJob pods:

PYTHONPATH=$PWD python3 main.py \
    ... \
    cluster.capabilities=[IPC_LOCK,SYS_PTRACE]

Perform a Dry Run of the Submission

To dry run the submission, set the environment variable, NEMO_LAUNCHER_DEBUG=1, which effectively replaces kubectl create with kubectl create --dry-run=client:

# Dry-run
NEMO_LAUNCHER_DEBUG=1 python3 main.py \
    ...

# Remove NEMO_LAUNCHER_DEBUG to actually submit
python3 main.py \
    ...

Perform a Dry Run of the Launcher Configuration

To dry run the hydra configuration of the launcher:

# Original Command that submits to cluster
PYTHONPATH=$PWD python3 main.py \
    training.model.global_batch_size=3

# (-c job): Prints yaml configuration without submitting
PYTHONPATH=$PWD python3 main.py \
    training.model.global_batch_size=3 \
    -c job

# (-c job -p <pkg>): Prints yaml configuration of just the cluster hydra package
PYTHONPATH=$PWD python3 main.py \
    training.model.global_batch_size=3 \
    -c job -p cluster

Note

Perform a Dry Run of the Launcher Configuration is helpful if you want to double check that your overrides provided on the command line are set correctly. Perform a Dry Run of the Submission is helpful if you want to inspect that the manifest to be submitted to the Kubernetes cluster is as expected.

Inspect the Argo Workflow Yaml

After running python main.py, you can inspect the Argo Workflow yaml manifest that was previously submitted.

# Assuming current working directory is NeMo-Framework-Launcher/launcher_scripts
ls results/<run-name>/<experiment-name>/<stage>.yaml

# The exact subdirectory structure may differ; however, the yaml manifest will always contain
# `kind: Workflow`. Use the following command to find them:
fgrep -r 'kind: Workflow' results/