6. Workload Examples

In this section, we give instructions for running several Workloads on your Run:ai on DGX Cloud cluster. The examples are not exhaustive, but can be adapted for your own workloads.

The following images are used in the workload examples and have been tested and validated as functional. Any other images are un-tested and may require adjustments to environment variables or additional debugging to function optimally.

  • nvcr.io/nvidia/nemo:24.05.01

  • nvcr.io/nvidia/nemo:24.05

  • nvcr.io/nvidia/nemo:24.07

  • nvcr.io/nvidia/pytorch:24.06-py3

  • nvcr.io/nvidia/pytorch:24.07-py3

  • nvcr.io/nvidia/pytorch:24.08-py3

  • nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda11.8-py3.9

6.1. Interactive NeMo Workload Job

In this example, we step through the process of creating an interactive workload using the NeMo container from NGC. Interactive workloads in Run:ai are called Workspaces. In this particular example, we will run a Jupyter notebook to fine-tune an LLM (Llama3-8B Instruct) using LoRA against the PubMedQA dataset.

6.1.1. Prerequisites and Requirements

The following are required before running the interactive NeMo job:

  • You must have accepted an invitation to your NGC org and added your NGC credentials to the Run:ai. See details here.

  • You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.

  • Your user must be able to access a project and department.

  • At least one A100 or newer GPU with 80 GB of memory.

  • You must have access to a compute resource with at least one GPU created in your scope that you can use.

  • You must create a Hugging Face account and agree to the Meta Llama 3 Community License Agreement while signed in to your Hugging Face account. You must then generate a Hugging Face read access token in your account settings. This token is required to access the Llama3 model in the Jupyter Notebook.

6.1.2. Creating the Data Source

To make it easier to reuse code and checkpoints in future jobs, a persistent volume chain (PVC) is created as a data source. The PVC can be mounted in jobs and will persist after the job completes, allowing any generated data to be reused.

  1. To create a new PVC, go to the Data Sources page. Click New Data Source and then PVC to open the PVC creation form.

  2. On the new form, set the desired scope.

Important

PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

  1. Give the PVC a memorable name like nemo-lora-checkpoints and add a description if desired.

  2. For the data options, select a new PVC storage class that suits your needs according to the PVC recommendations here. In this example, zonal-rwx is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter 10 TB for the size to ensure we have plenty of capacity for future jobs. Select Filesystem as the volume mode. Lastly, set the Container path to /checkpoints, where the PVC will be mounted inside containers. This completed section should look as shown below.

    Data source options.
  3. Once the form is complete, create the data source using the button at the bottom of the page.

Note

The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.

6.1.3. Creating the Workspace

  1. With the PVC created, we can configure the workspace to tell Run:ai how the job should be launched. To do so, go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch.

  4. Enter a descriptive name for your Workspace, such as nemo-pubmedqa-lora. Click Continue. After a few seconds the Environment pane of the creation process will appear.

  5. The environment is a template with common settings that can be reused for similar jobs. For example, an environment might be based on a PyTorch container and have Jupyter exposed as a tool enabling anyone who needs to run interactive PyTorch jobs inside a Jupyter notebook to use the same template instead of launching unique jobs and re-configuring the settings each time they want to experiment. In this example, a new environment will be created based on the NeMo container and expose Jupyter as a tool to make it easier to reuse the template in the future.

  6. To create a new environment for the NeMo job, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  7. In the Environment creation form, enter a name for the environment, such as “nemo-jupyter” and optionally add a description.

  8. Under Image URL, put nvcr.io/nvidia/nemo:24.07. This pulls the NeMo container from NGC. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check if there are updates to the image.

  9. Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.

  10. Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the job, and more.

  11. Since we want to run an interactive Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate and Container port “8888”.

  12. Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter jupyter-lab as the command, and --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.

  13. For the directory, specify /opt/NeMo as the directory to launch the job in. This will launch the Jupyter lab session inside the /opt/NeMo directory inside the container.

  14. Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.

  15. Go to the Compute resource pane and select a single-GPU compute resource for your environment. While it is possible to use more GPUs for LoRA, most tasks can be completed with just a single Ampere or newer GPU with 80GBs of memory in a relatively short period.

  16. In the Data Sources section, select your nemo-lora-checkpoints PVC.

  17. Go to the bottom of the page and click Create Workspace. This will add your job to the queue and it will automatically get scheduled once cluster resources are available.

  18. After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster and the PVC needs to be provisioned for the first time.

  19. Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.

6.1.4. Running the Notebook

Once connected to the Jupyter lab session, navigate to the example notebook by opening to the tutorials/llm/llama-3 directory inside the /opt/NeMo folder the container automatically opens to. Inside this directory are a couple of notebooks, including llama3-lora-nemofw.ipynb, which is the one we will be following. Double-click that file to open the notebook on the main screen.

The notebook walks through what each step does, but at a high level, it does the following:

  • Downloads the Llama3-8B Instruct model from Hugging Face

  • Converts the model to the .nemo format to be easily ingested by the NeMo toolkit

  • Downloads and preprocesses the PubMedQA dataset

  • Fine-tunes the Llama3-8B Instruct model using the LoRA PEFT technique against the PubMedQA dataset

  • Deploys the model for local inference

  • Evaluates the model against the test dataset to determine the overall accuracy of the fine-tuned model

The notebook can be stepped through cell-by-cell or run all at once, and the results viewed later. The only important note is that cell 2 requires you to add your Hugging Face access token, as listed in the requirements section in the cell text, in order to authenticate with Hugging Face to download the model.

As the notebook runs, you can navigate back to the Run:ai workloads page, where your job now shows metrics based on your resource usage. The metrics tab shows the GPU and CPU compute and memory usage during the job. This is a great resource for determining how your job is performing.

6.1.5. Persisting the Model to Storage

As the PVC was added to the workspace and is mounted at /checkpoints inside the container, any files or directories that are saved to /checkpoints inside the container will be saved in the PVC and available in future jobs.

  1. To save the generated model to the PVC, open a terminal tab inside the Jupyter lab session and navigate to the notebook directory with:

    cd /opt/NeMo/tutorials/llm/llama-3
    
  2. The model is saved at results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo. It can be verified with:

    1  $ du -sh results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo
    2  21M     results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo
    
  3. To copy the file to the PVC, run:

    cp results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo /checkpoints
    
  4. Once the checkpoint is copied, it will be available in all future workloads that use that same PVC unless the file is intentionally removed.

6.1.6. Cleaning up the Environment

After the job finishes, it can be deleted to free up resources for other jobs. Note that certain jobs do not self-terminate and should be monitored to ensure resources are not idle for completed jobs.

6.2. RAPIDS and Polars Workspace

In this example, we step through the process of creating another Workspace, which uses the RAPIDS container from NGC. We will launch a Jupyter environment to experiment with the new integration between RAPIDS and Polars.

6.2.1. Prerequisites and Requirements

The following are required before running the RAPIDS workspace:

  • You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.

  • Your user must be able to access a project and department.

  • At least one idle GPU within your DGX Cloud cluster.

  • You must have access to a compute resource with at least one GPU created in your scope that you can use.

6.2.2. Creating the Workspace

  1. Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch.

  4. Enter a descriptive name for your Workspace, such as rapids-polars. Click Continue. After a few seconds the Environment pane of the creation process will appear.

  5. To create a new environment for the RAPIDS workspace, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  6. In the Environment creation form, enter a name for the environment, such as “rapids-jupyter” and optionally add a description.

  7. Under Image URL, put nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda11.8-py3.9. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases but if you are pushing new containers to the NGC org with the same tag “always pull the image from the registry” should be used to check if there are updates to the image.

  8. Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.

  9. Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the workspace, and more.

  10. Since we want to run a Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate and Container port “8888”.

  11. Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter jupyter-lab as the command, and --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.

  12. Do not modify any other settings.

  13. Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.

  14. Go to the Compute resource pane and select a single GPU compute resource for your environment.

  15. Go to the bottom of the page and click Create Workspace. This will add your job to the queue and it will automatically get scheduled once cluster resources are available.

  16. After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your workspace is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster.

  17. Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload, and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.

6.2.3. Downloading the Polars Notebook

Once connected to the Jupyter lab session, you can navigate to the integrated RAPIDS notebook samples within the notebooks directory and experiment within your workspace.

To get started downloading the GPU-accelerated Polars example notebook, open a terminal interface by clicking the Terminal box.

In the resulting terminal, run the following command to download the new notebook.

wget https://raw.githubusercontent.com/rapidsai-community/showcase/refs/heads/main/accelerated_data_processing_examples/polars_gpu_engine_demo.ipynb

6.2.4. Running the Polars Notebook

The notebook provides an introduction to what Polars is and where it is best used, along with code to:

  • Install Polars with GPU Engine, along with other notebook dependencies

  • Download a sample Kaggle dataset

  • Initialize Polars to use the GPU in DGX Cloud

  • Run various sample analyses to provide performance comparisons between Polars with CPU only and Polars with GPU

To open the notebook, double-click on the file named polars_gpu_engine_demo.ipynb when it becomes visible in the Jupyter file navigation pane.

The notebook can be stepped through cell-by-cell or run all at once, and the results can be viewed later.

As the notebook is running, you can navigate back to the Run:ai workloads page, where your job now shows metrics based on your resource usage. The metrics tab shows the GPU and CPU compute and memory usage during the job. This is a great resource for determining how your job is performing. Within the Jupyter notebook page you are using, you can also use an integrated GPU performance visualization plugin from the RAPIDS NGC container (it is the tab on the left side of the page with a GPU icon on it).

6.2.5. Cleaning up the Environment

After you are finished with the workload, it can be stopped or deleted to free up resources for other workloads. Note that certain workloads do not self-terminate and workloads should be monitored to ensure resources are not sitting idle for completed workloads.

6.3. Download Data From S3 in a Training Workload

In this example, we will create a new Workload to download data from a private S3 bucket. We will use a Training workload due to the noninteractive nature of this example. We will configure the workload with private information to access the S3 bucket and execute the commands necessary to install and run the s5cmd tool to download the data from that bucket to a PVC.

6.3.1. Prerequisites and Requirements

The following are required before running the data download Training workload:

  • You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.

  • Your user must be able to access a project and department.

  • You must have access to a compute resource in your scope that you can use—it can be CPU-only, which is ideal to avoid using GPU resources when it is unnecessary.

  • You must have a private S3 bucket with data in it.

  • You must have the necessary permissions to download that data from S3 via access key ID and secret access key.

6.3.2. Creating the Credentials

To securely insert the necessary keys into the workload, we will create a Credential for the access key ID and the secret access key, allowing access to our S3 bucket.

  1. To create a new Credential, go to the Credentials page. Click + NEW CREDENTIALS and then Access key to open the access key creation form.

  2. On the form, set the desired scope. The scope should be as narrow as possible—in this case, set the scope to match the targeted project for your workload.

  3. Give the access key a descriptive name, such as aws-access-keys, and add a description if desired.

  4. In the Secret pane, select New secret. Set the Access key field to your AWS access key ID, and the Access secret field to your AWS secret access key.

  5. Click CREATE CREDENTIALS to save the new Credential.

6.3.3. Creating the Data Source

We will create a PVC Data Source to allow the resulting dataset to be shared with future workloads.

  1. To create a new PVC, go to the Data Sources page. Click New Data Source and then PVC to open the PVC creation form.

  2. On the new form, set the desired scope.

Important

PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

  1. Give the PVC a memorable name like s3-dataset and add a description if desired.

  2. For the data options, select a new PVC storage class that suits your needs according to the PVC recommended storage classes. In this example, zonal-rwx is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter a capacity that will fit your target dataset—we will specify 10 TB for this example. Select Filesystem as the volume mode. Lastly, set the Container path to /dataset, where the PVC will be mounted inside containers.

  3. Once the form is complete, click the CREATE DATA SOURCE button at the bottom of the page.

Note

The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.

6.3.4. Creating the Training Workload

  1. Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Training. You will be taken to the New training creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch.

  4. Enter a descriptive name for your workload, such as s3-download. Click Continue. After a few seconds, the Environment pane of the creation process will appear.

  5. To create a new environment for the workload, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  6. In the Environment creation form, enter a name for the environment, such as s3-download-env, and optionally add a description.

  7. Under Image URL, put nvcr.io/nvidia/pytorch:24.08-py3. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check for updates to the image.

  8. Under the Workload architecture & type pane, select Standard and Training if they are not selected already. These options are likely already selected and grayed out.

  9. Do not modify any other settings, including the Runtime settings section—we will fill that in for the specific workload, keeping this environment multipurpose for future workloads.

  10. Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New training creation page. The environment you just created should now be selected.

  11. Open the Runtime settings section for this training workload. We will execute a sequence of several commands within this workload. Set the Command field to bash -c, and the Arguments field to "wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz; tar -xf s5cmd_2.2.2_Linux-64bit.tar.gz; ./s5cmd cp s3://your-s3-bucket-name-here/* /dataset"

  12. To embed the Credentials into the workload, click the + ENVIRONMENT VARIABLE button twice.

  13. For the first entry’s Name field, enter AWS_ACCESS_KEY_ID, set the Source to Credentials, set the Credentials name field to aws-access-keys, and set the Secret Key field to AccessKeyId.

  14. For the second entry’s Name field, enter AWS_SECRET_ACCESS_KEY, set the Source to Credentials, set the Credentials name field to aws-access-keys, and set the Secret Key field to SecretKey.

  15. Go to the Compute resource pane and select your desired compute resource—ideally a CPU-only resource to avoid consuming GPU resources.

  16. In the Data Sources section, select your s3-dataset PVC.

  17. Go to the bottom of the page and click CREATE TRAINING. This will add your workload to the queue and it will automatically get scheduled once cluster resources are available.

  18. After creating the training workload, you will be taken to the workload overview page, where you can view the status of your workload. Your training workload is running once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster.

  19. Once the status is “Completed”, you can review the outcome by clicking the workload and selecting the SHOW DETAILS button on the right side of the blue banner. Select the LOGS tab in the workload’s detail pane—it should indicate that data is being copied to your new PVC from your S3 bucket.

6.3.5. Cleaning up the Environment

After the job finishes, it can be deleted to free up resources for other jobs. Note that certain jobs do not self-terminate and should be monitored to ensure resources are not idle for completed jobs.

6.4. Using Your Data From Google Cloud Storage

There are two ways to use your data residing in Google or other clouds:

  1. Copy the data into the DGX Cloud environment, from Google Cloud Storage (GCS) or other cloud, onto the shared store NFS filesystems.

  2. Access the data directly from services within the cloud provider. We recommend running this job as a pure Kubernetes job at this time.

Here are examples of the following scenarios:

6.4.1. Example: Using GCS to copy data to PVC within Run:ai

  1. In GCP, create a custom GCP role with the required permissions.

    The following is a minimal set of permissions, which may be created by cloning the Storage Object Viewer role and adding the storage.buckets.get permission.

    1 storage.buckets.get
    2 storage.buckets.getIamPolicy
    3 storage.buckets.listEffectiveTags
    4 storage.buckets.listTagBindings
    5 storage.managedFolders.create
    6 storage.managedFolders.get
    7 storage.managedFolders.list
    8 storage.objects.list
    
  2. Create a GCP service account (SA).

  3. Assign the custom role to the service account on the target bucket(s).

  4. Create a service account key for the SA created and download the .json key file. Protect the key file as it contains your private key.

  5. Create a Kubernetes secret with the SA key in your Run:ai project (namespace) using the following command.

    Note

    You must have your kubeconfig set to the Run:ai researcher configuration provided by DGX Cloud.

    kubectl create secret generic google-gcs-creds --from-file=key=projectID-deadbeef.json  -n runai-<Run:ai projectName>
    
  6. Within Run:ai, create a data source as a PVC using the following sequence:

    1. Specify the scope, name, and an optional description for the PVC.

    2. In the Data Mount section, select New PVC. Choose a storage class, access mode, and always Filesystem. Then, set a default mount path within any containers.

    Note

    The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.

  7. Use the kubectl get pvc command to determine the actual name of the PVC created by the Run:ai console. The actual name will be similar to name-for-pvc-<cluster-identifier>-<digits>. Use this for the next step.

  8. Create a Kubernetes manifest YAML utilizing the Google Cloud command line to copy the data from GCS into the PVC. The example below uses the secret you created and the PVC as mounted volumes to the container. The ProjectID and Bucket name are passed to a gcloud rsync via environment variables.

     1 apiVersion: v1
     2   kind: Pod
     3   metadata:
     4     name: gcp-gcs-copy-2-pv-podd
     5     namespace: runai-<Run:ai projectName>
     6   spec:
     7     volumes:
     8       - name: pv-storage
     9         persistentVolumeClaim:
    10           claimName: name-for-pvc
    11       - name: gcp-sa-credential
    12         secret:
    13           secretName: google-gcs-creds
    14     restartPolicy: OnFailure
    15     schedulerName: runai-scheduler
    16     containers:
    17       - name: gcp-cli-container
    18         image: google/cloud-sdk
    19         imagePullPolicy: IfNotPresent
    20         env:
    21           - name: PROJECTNAME
    22             value: projectID
    23           - name: BUCKET
    24             value: bucket-name
    25         volumeMounts:
    26           - mountPath: "/data/shared"
    27             name: pv-storage
    28           - name: gcp-sa-credential
    29             mountPath: /etc/gcp_cred
    30             readOnly: true
    31         command: ["sh","-c"]
    32           args:
    33             - gcloud auth activate-service-account --key-file=/etc/gcp_cred/key;
    34               gcloud storage rsync -r gs://${BUCKET} /data/shared/ --project=${PROJECTNAME};
    
  9. Submit the Kubernetes manifest for execution in the DGX Cloud environment using the following command: kubectl apply -f mainifast.yaml -n runai-<Run:ai ProjectName>.

  10. Monitor the job for completion using kubectl get pod gcp-gcs-copy-2-pv-podd

You can now use the PVC in your Run:ai jobs.

6.4.2. Example: Using data directly within Google Cloud Services

Run:ai training jobs can directly access data in Google Cloud storage. This section gives an example of how to do this in a Python job, and the small configuration steps needed to make it happen.

The following Python code demonstrates how to read all text files from a bucket into a list of strings. While this approach may not be practical for AI training workloads, it will help illustrate the necessary configurations.

 1  from google.cloud import storage
 2  import argparse
 3  import os
 4
 5  def get_bucket_name():
 6    parser = argparse.ArgumentParser(description="Read text files from a    Google Cloud Storage bucket.")
 7    parser.add_argument("bucket_name", help="The name of the bucket to read from.", nargs='?')
 8    args = parser.parse_args()
 9
10    if args.bucket_name:
11        return args.bucket_name
12    elif "BUCKET" in os.environ:
13        return os.environ["BUCKET"]
14    else:
15        print("Error: Bucket name not provided. Please provide it as a command line argument or set the   BUCKET environment variable.")
16        return None
17
18  def read_bucket_files(bucket_name):
19      storage_client = storage.Client()
20      bucket = storage_client.bucket(bucket_name)
21
22      file_contents = []
23      blobs = bucket.list_blobs()
24      for blob in blobs:
25          if blob.name.endswith(".txt"):
26            # Download the file content
27            content = blob.download_as_string().decode("utf-8")
28            file_contents.append(content)
29
30      return file_contents
31
32 if __name__ == "__main__":
33   bucket_name = get_bucket_name()
34    if bucket_name:
35        file_contents = read_bucket_files(bucket_name)
36    else:
37        exit(1)
38
39    for content in file_contents:
40        print(content)

To execute the above code in a containerized training job on Kubernetes, you will need to do the following:

  1. Include the Google Cloud Storage Python libraries into the container.

  2. Inject a secret into the container, allowing proper authorization to the GCS API.

To install the google-cloud-storage Python library, use pip to install it into the container image. For example, use the following Dockerfile:

1  FROM python:3.12
2
3  RUN pip install google-cloud-storage
4  RUN mkdir /app
5  COPY read.py /app/
6
7  ENTRYPOINT ["python"]
8  CMD ["/app/read.py"]

To inject the secret into the container, use a Kubernetes secret mounted as a file with the environment variable GOOGLE_APPLICATION_CREDENTIALS pointed at the file. The example below uses the same secret as the one created above.

 1 apiVersion: v1
 2  kind: Pod
 3  metadata:
 4    name: gcs-reader
 5  spec:
 6    containers:
 7      - name: gcs-reader
 8        image: nvcr.io/<orgID>/gcs-read:latest
 9        imagePullPolicy: IfNotPresent
10        command: ["python", "/app/read.py"]
11        env:
12          - name: BUCKET
13            value: bucket-name
14          - name: GOOGLE_APPLICATION_CREDENTIALS
15            value: /etc/gcp_cred/key
16        volumeMounts:
17          - name: gcs-creds
18            mountPath: /etc/gcp_cred
19            readOnly: true
20    imagePullSecrets:
21    - name: nvcr.io-creds
22    volumes:
23      - name: gcs-creds
24        secret:
25          secretName: google-gcs-creds
26    restartPolicy: OnFailure

Note

Currently, there isn’t a method to mount Kubernetes secrets as files in Run:ai jobs. Run.ai supports only environment variables. Since the Google libraries require a token file (which should be a secret), you will need workarounds to use them with Run:ai jobs. The recommended workarounds are:

  • only use Kubernetes native jobs for data manipulation, or

  • copy the GCP service account key into shared storage and point the GOOGLE_APPLICATION_CREDENTIALS environmental variable at that shared storage location.

6.4.3. Example: Reading data from Google BigQuery

You can use data from Google BigQuery directly in your training jobs on Run:ai. This example shows the Python script and the small configuration code needed.

In GCP, the BigQuery Data Viewer role contains the necessary permissions and may be assigned at the table, dataset or project levels. Following the principle of least privilege, NVIDIA recommends assigning the role at the Table level, unless you need to read multiple tables from the dataset or project. The following is a Python code example for a BigQuery reader.

 1  import argparse
 2  import os
 3  from google.cloud import bigquery
 4
 5  def read_bigquery_table(project, dataset, table):
 6    client = bigquery.Client()
 7    table_ref = client.dataset(dataset).table(table)
 8    table = client.get_table(table_ref)
 9
10    rows = client.list_rows(table)
11    return rows
12
13  if __name__ == "__main__":
14    parser = argparse.ArgumentParser(description='Read a BigQuery table.')
15    parser.add_argument('--project', help='The project ID.')
16    parser.add_argument('--dataset', help='The dataset ID.')
17    parser.add_argument('--table', help='The table ID.')
18
19    args = parser.parse_args()
20
21    project = args.project or os.environ.get('PROJECTNAME')
22    dataset = args.dataset or os.environ.get('DATASET')
23    table = args.table or os.environ.get('TABLENAME')
24
25    if not project or not dataset or not table:
26      print("Error: Missing project, dataset, or table name.")
27      exit(1)
28
29    rows = read_bigquery_table(project, dataset, table)
30
31    for row in rows:
32      print(row)

Include the above code in a container using the following Dockerfile:

1  FROM python:3.12
2
3  RUN pip install google-cloud-bigquery
4  RUN mkdir /app
5  COPY read-bq.py /app/
6
7  ENTRYPOINT ["python"]
8  CMD ["/app/read-bq.py"]

Finally, you can start a job/pod with the following Kubernetes manifest:

 1  apiVersion: v1
 2  kind: Pod
 3  metadata:
 4    name: read-bq
 5  spec:
 6    containers:
 7      - name: read-bq
 8        image: nvcr.io/<orgID>/read-from-bq:latest
 9        imagePullPolicy: Always
10        command: ["python", "/app/read-bq.py"]
11        env:
12          - name: PROJECTNAME
13            value: project
14          - name: DATASET
15            value: austin_training
16          - name: TABLENAME
17            value: example
18          - name: GOOGLE_APPLICATION_CREDENTIALS
19            value: /etc/gcp_cred/key
20        volumeMounts:
21          - name: gcs-creds
22            mountPath: /etc/gcp_cred
23            readOnly: true
24    imagePullSecrets:
25    - name: nvcr.io-creds
26    volumes:
27      - name: gcs-creds
28        secret:
29          secretName: google-gcs-creds
30    restartPolicy: OnFailure

6.4.4. Alternative to using Service Account Keys

Service account keys are powerful credentials, and can present a security risk if they are not managed correctly. Workload Identity Federation eliminates the maintenance and security burden associated with service account keys. With Workload Identity Federation, you can use Identity and Access Management (IAM) to grant external identities IAM roles, and direct access on Google Cloud resources. You can also grant access through service account impersonation.

As part of environment onboarding, NVIDIA will provide you with the OIDC issuer URL for the environment. Navigate to your organization’s Google Cloud Console, then go to IAM & Admin > Workload Identity Federation. Click the CREATE POOL button to proceed.

Create a GCP Workload Identity Federation Pool.

In the subsequent workflow wizard, provide a name for the pool and an optional description and select OpenID Connect (OIDC) as the provider.

Create a GCP Workload Identity Pool page 1. Create a GCP Workload Identity Pool page 2.

After selecting the provider type, enter “dgx-cloud” as the provider name and paste the Issuer URL provided with your environment into the Issuer (URL) field. Under Audiences, select Default audience.

Create a GCP Workload Identity Pool page 2 (Cont ...).

Lastly, add a mapping google.subject to assertion.sub in the Configure provider attributes step.

Create a GCP Workload Identity Pool page 3.

Once complete, the Workload Identity Pool will confirm that it can communicate with the “dgx-cloud” Kubernetes cluster. You should see a green checkmark in the upper left corner near the provider you created, as shown below.

Create a GCP Workload Identity Pool success.

Then, click CONNECTED SERVICE ACCOUNTS in the upper right corner and then select the DOWNLOAD CONFIG button. When the Configure your application dialog comes up, select the provider you created and then paste the path /var/run/service-account/token into the OIDC ID token path field, leaving the format as “text” and then click DOWNLOAD CONFIG to download the required configuration file.

Create a GCP download config button. Create a GCP download config dialog.

After saving the configuration file, click DISMISS. On the main Identity Federation Pool page, locate and copy the value in the IAM Principal field. It will look similar to:

principal://iam.googleapis.com/projects/1091797687346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/SUBJECT_ATTRIBUTE_VALUE

where SUBJECT_ATTRIBUTE_VALUE represents the Kubernetes identifier.

For example, if the service account is default in the Run:ai project runai-projectname and we substitute system:serviceaccount:runai-projectname:default as the SUBJECT_ATTRIBUTE_VALUE, the overall IAM Principal will be:

principal://iam.googleapis.com/projects/1068792787346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/system:serviceaccount:runai-projectname:default``

This is the IAM Principal to use (within GCP IAM) to grant access to workloads in that Run:ai project as shown below:

1  gcloud projects add-iam-policy-binding projects/PROJECT_ID \
2    --role=roles/container.clusterViewer \
3    --member=principal://iam.googleapis.com/projects/1068792787346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/system:serviceaccount:runai-projectname:default \
4    --condition=None

Or with the Cloud Console UI:

GCP grant access to bucket.

Once IAM permissions have been provided, a ConfigMap should be constructed on the “dgx-cloud” Kubernetes cluster, with the configuration file you downloaded using the following command.

kubectl create configmap google-wi --from-file clientLibraryConifg-dgx-cloud.json --namespace runai-projectname

Once the ConfigMap has been created, each workload should be augmented with three items, a volume projecting the SA token, a volume mapping the ConfigMap, and an environment variable pointing to the ConfigMap projection. The following example manifest shows a simple container with the three requirements.

Note

The audience property in the serviceAccountToken below may be copied from the clientLibraryConifg file audience field (be sure to add the https:).

 1  apiVersion: v1
 2  kind: Pod
 3  metadata:
 4    name: oidc-example
 5  spec:
 6    containers:
 7    - name: example
 8      image: google/cloud-sdk
 9      command: ["/bin/sh", "-c", "gcloud auth login --cred-file $GOOGLE_APPLICATION_CREDENTIALS && sleep infinity"]
10      volumeMounts:
11      - name: token
12          mountPath: "/var/run/service-account"
13          readOnly: true
14      - name: workload-identity-credential-configuration
15          mountPath: "/etc/workload-identity"
16          readOnly: true
17      env:
18      - name: GOOGLE_APPLICATION_CREDENTIALS
19        value: "/etc/workload-identity/clientLibraryConifg-dgx-cloud.json"
20    volumes:
21    - name: token
22      projected:
23        sources:
24        - serviceAccountToken:
25          audience: https://iam.googleapis.com/projects/<Your PROJECT_ID>/locations/global/workloadIdentityPools/<Your WORKLOAD_POOL>/providers/<Your WORKLOAD_PROVIDER>
26          expirationSeconds: 3600
27          path: token
28    - name: workload-identity-credential-configuration
29      configMap:
30        name: cloud-ce-config

To confirm the configurations, run kubectl exec -it oidc-example -n runai-projectname -- bash to access the container and gcloud config list to confirm the configuration.

The GCP resources configured in IAM should be accessible using either gcloud or gsutil commands.

6.5. Using Your Data From Amazon S3

There are two ways to use your data residing in AWS or other clouds:

  1. Copy the data into the DGX Cloud environment, from Amazon S3 or other cloud storage, onto the shared store NFS filesystems.

  2. Access the data directly from services within the cloud provider. We recommend running this job as a pure Kubernetes job at this time.

Here are examples of the following scenarios:

6.5.1. Example: Using S3 to copy data to PVC within Run:ai

  1. In AWS, create a custom AWS IAM role with the required permissions.

    The following is a minimal set of permissions, which may be created by granting S3 read access.

    1 s3:GetObject
    2 s3:ListBucket
    3 s3:GetBucketPolicy
    4 s3:PutObject
    
  2. Create an AWS IAM user or role.

  3. Assign the custom role to the IAM user or role to allow access to the target S3 bucket(s).

  4. Create an access key for the IAM user and download the .csv key file. Protect the key file as it contains your private credentials.

  5. Create a Kubernetes secret with the IAM access keys in your Run:ai project (namespace) using the following command.

    Note

    You must have your kubeconfig set to the Run:ai researcher configuration provided by DGX Cloud.

    kubectl create secret generic aws-s3-creds --from-literal=aws_access_key_id=<ACCESS_KEY_ID> --from-literal=aws_secret_access_key=<SECRET_ACCESS_KEY> -n runai-<Run:ai projectName>
    
  6. Within Run:ai, create a data source as a PVC using the following sequence:

    1. Specify the scope, name, and an optional description for the PVC.

    2. In the Data Mount section, select New PVC. Choose a storage class, access mode, and always Filesystem. Then, set a default mount path within any containers.

    Note

    The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.

  7. Use the kubectl get pvc command to determine the actual name of the PVC created by the Run:ai console. The actual name will be similar to name-for-pvc-<cluster-identifier>-<digits>. Use this for the next step.

  8. Create a Kubernetes manifest YAML utilizing the AWS CLI to copy the data from S3 into the PVC. The example below uses the secret you created and the PVC as mounted volumes to the container. The bucket name is passed to an aws s3 sync command via environment variables.

     1 apiVersion: v1
     2   kind: Pod
     3   metadata:
     4     name: aws-s3-copy-2-pv-pod
     5     namespace: runai-<Run:ai projectName>
     6   spec:
     7     volumes:
     8       - name: pv-storage
     9         persistentVolumeClaim:
    10           claimName: name-for-pvc
    11       - name: aws-credential
    12         secret:
    13           secretName: aws-s3-creds
    14     restartPolicy: OnFailure
    15     schedulerName: runai-scheduler
    16     containers:
    17       - name: aws-cli-container
    18         image: amazon/aws-cli
    19         imagePullPolicy: IfNotPresent
    20         env:
    21           - name: BUCKET
    22             value: <bucket-name>
    23         volumeMounts:
    24           - mountPath: "/data/shared"
    25             name: pv-storage
    26           - name: aws-credential
    27             mountPath: /root/.aws/
    28             readOnly: true
    29         command: ["sh","-c"]
    30           args:
    31             - aws s3 sync s3://${BUCKET} /data/shared/;
    
  9. Submit the Kubernetes manifest for execution in the DGX Cloud environment using the following command: kubectl apply -f manifest.yaml -n runai-<Run:ai ProjectName>.

  10. Monitor the job for completion using kubectl get pod aws-s3-copy-2-pv-pod.

You can now use the PVC in your Run:ai jobs.

6.5.2. Example: Using data directly within AWS Services

Run:ai training jobs can directly access data in Amazon S3. This section gives an example of how to do this in a Python job, and the small configuration steps needed to make it happen.

The following Python code demonstrates how to read all text files from a bucket into a list of strings. While this approach may not be practical for AI training workloads, it will help illustrate the necessary configurations.

 1  import boto3
 2  import argparse
 3  import os
 4
 5  def get_bucket_name():
 6    parser = argparse.ArgumentParser(description="Read text files from an AWS S3 bucket.")
 7    parser.add_argument("bucket_name", help="The name of the bucket to read from.", nargs='?')
 8    args = parser.parse_args()
 9
10    if args.bucket_name:
11        return args.bucket_name
12    elif "BUCKET" in os.environ:
13        return os.environ["BUCKET"]
14    else:
15        print("Error: Bucket name not provided. Please provide it as a command line argument or set the BUCKET environment variable.")
16        return None
17
18  def read_bucket_files(bucket_name):
19      s3_client = boto3.client('s3')
20      file_contents = []
21      response = s3_client.list_objects_v2(Bucket=bucket_name)
22      for obj in response.get('Contents', []):
23          if obj['Key'].endswith(".txt"):
24              file_obj = s3_client.get_object(Bucket=bucket_name, Key=obj['Key'])
25              content = file_obj['Body'].read().decode('utf-8')
26              file_contents.append(content)
27      return file_contents
28
29  if __name__ == "__main__":
30    bucket_name = get_bucket_name()
31    if bucket_name:
32        file_contents = read_bucket_files(bucket_name)
33    else:
34        exit(1)
35
36    for content in file_contents:
37        print(content)

To execute the above code in a containerized training job on Kubernetes, you will need to do the following:

  1. Include the AWS SDK for Python (Boto3) into the container.

  2. Inject a secret into the container, allowing proper authorization to the S3 API.

To install the boto3 Python library, use pip to install it into the container image. For example, use the following Dockerfile:

1  FROM python:3.12
2
3  RUN pip install boto3
4  RUN mkdir /app
5  COPY read.py /app/
6
7  ENTRYPOINT ["python"]
8  CMD ["/app/read.py"]

To inject the secret into the container, use a Kubernetes secret mounted as a file with the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The example below uses the same secret as the one created above.

 1 apiVersion: v1
 2  kind: Pod
 3  metadata:
 4    name: s3-reader
 5  spec:
 6    containers:
 7      - name: s3-reader
 8        image: nvcr.io/<orgID>/s3-read:latest
 9        imagePullPolicy: IfNotPresent
10        command: ["python", "/app/read.py"]
11        env:
12          - name: BUCKET
13            value: <bucket-name>
14          - name: AWS_ACCESS_KEY_ID
15            valueFrom:
16              secretKeyRef:
17                name: aws-s3-creds
18                key: aws_access_key_id
19          - name: AWS_SECRET_ACCESS_KEY
20            valueFrom:
21              secretKeyRef:
22                name: aws-s3-creds
23                key: aws_secret_access_key
24        volumeMounts:
25          - name: aws-credentials
26            mountPath: /root/.aws/
27            readOnly: true
28    imagePullSecrets:
29    - name: nvcr.io-creds
30    volumes:
31      - name: aws-credentials
32        secret:
33          secretName: aws-s3-creds
34    restartPolicy: OnFailure

Note

Currently, there isn’t a method to mount Kubernetes secrets as files in Run:ai jobs. Run:ai supports only environment variables. Since the AWS SDK for Python requires access keys (which should be stored as secrets), you will need workarounds to use them with Run:ai jobs. The recommended workarounds are:

  • only use Kubernetes native jobs for data manipulation, or

  • copy the AWS credentials into shared storage and point the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environmental variables at that shared storage location.

6.6. Running Visual Studio Code Inside a Workload

In this example, we step through the process of creating a Workspace via the Run:ai Command-Line Interface (CLI) that runs a VS Code instance, and exposing the necessary port to our local workstation in order to access the VS Code web interface.

6.6.1. Prerequisites and Requirements

The following are required before running the VS Code workspace:

  • You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.

  • Your user must be able to access a project and department.

  • You must have installed the Run:ai CLI tool, as detailed in the Advanced documentation.

6.6.2. Creating the Workspace via CLI on a local workstation

  1. Open a terminal on your workstation (or other system that has the Run:ai CLI installed and configured on it).

    Note

    If the system you have installed the Run:ai CLI on is a remote system, you can SSH to that system and create a local port forward, connecting the port Run:ai will use with the same port on your local workstation. For example, ssh -L 8080:localhost:8080 the-ip-or-hostname-of-your-remote-system.

  2. Make note of your target project.

  3. Enter the following command and arguments into your terminal (inserting your target project as the argument to the --project flag) and hit enter.

    runai submit vscode --project your-project-name-here --image linuxserver/code-server:4.92.2 --environment PUID=0 --environment PGID=0 --interactive
    
  4. (Optional) Validate that the project is running with the following command.

    runai describe job vscode --project your-project-name-here
    
  5. Forward a port to your local workstation through the Run:ai CLI with the following command. The first port in the argument to port is arbitrary - it can be changed to any unused port on your system, instead of 8080.

    runai port-forward vscode --project your-project-name-here --port 8080:8443
    
  6. To access the VS Code web interface, navigate to http://localhost:8080.

6.6.3. Cleaning up the Environment

After you are finished with the workload, it can be stopped or deleted to free up resources for other workloads. To delete the workload from the CLI, use the following command.

runai delete --project demo-project job vscode

6.7. Using WandB with a Workspace

In this example, we will create an interactive Workload with a Weights and Biases (WandB) integration for experiment tracking. For a training Workload, a similar process to creating and adding the credential and environment variable can be followed. Once the API Key is connected to your Workload, your WandB run should automatically connect to your account.

6.7.1. Prerequisites and Requirements

The following are required before using the WandB integration with an interactive Workload:

  • You must have the user role of L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.

  • Your user must be able to access a project and department.

  • You must have access to a compute resource in your scope that you can use.

  • You must have access to a Weights and Biases account and have an API key.

6.7.2. Creating the Credentials

To securely insert the necessary keys into the workload, we will create a Credential for the access key ID and the secret access key, allowing access to WandB.

  1. To create a new Credential, go to the Credentials page. Click + NEW CREDENTIALS and then Access key to open the access key creation form.

  2. On the form, set the desired scope. The scope should be as narrow as possible—in this case, set the scope to match the targeted project for your workload.

  3. Give the access key a descriptive name, such as wandb-access-key, and add a description if desired.

  4. In the Secret pane, select New secret. Set the Access key field to WANDB_API_KEY, and the Access secret field to your WandB API Key (this can be found in your profile settings on the WandB site).

  5. Click CREATE CREDENTIALS to save the new Credential.

6.7.3. Creating the Workspace

  1. Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch.

  4. Enter a descriptive name for your workload, such as wandb-training. Click Continue. After a few seconds, the Environment pane of the creation process will appear.

  5. To create a new environment for the workload, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  6. In the Environment creation form, enter a name for the environment, such as wandb-training-env, and optionally add a description.

  7. Under Image URL, put nvcr.io/nvidia/pytorch:24.08-py3. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases but if you are pushing new containers to the NGC org with the same tag “always pull the image from the registry” should be used to check if there are updates to the image.

  8. Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.

  9. Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the workspace, and more.

  10. Since we want to run a Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate, and Container port “8888”.

  11. Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter jupyter-lab as the command, and --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.

  12. Do not modify any other settings.

  13. Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.

  14. Go to the Compute resource pane and select a single GPU compute resource for your environment.

  15. To embed the Credentials into the workload, click the + ENVIRONMENT VARIABLE button.

  16. For the Name field, enter WANDB_API_KEY, set the Source to Credentials, set the Credentials name field to wandb-access-key, and set the Secret Key field to SecretKey.

    Note

    If your WandB project is connected to a specific WandB team, an additional environment variable may be needed. For the Name field, enter WANDB_ENTITY, set the Source to Custom, and set the Value field to the name of your team.

  17. Go to the Compute resource pane and select your desired compute resource. For this example, a single GPU resource is sufficient.

  18. Go to the bottom of the page and click CREATE WORKSPACE. This will add your workload to the queue and it will automatically get scheduled once cluster resources are available.

  19. After creating the Workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your workspace is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster.

  20. Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.

6.7.4. Connecting to WandB in your Jupyter Notebook

  1. PyTorch does not have WandB integrated automatically in the container. Open a terminal tab in your Jupyter lab session to download the WandB client. Run the following:

    pip install wandb
    
  2. Open a new notebook and run the following code to test your WandB connection:

import wandb
import random

# start a new wandb run to track this script
wandb.init(
    # set the wandb project where this run will be logged
    project="my-awesome-project",

    # track hyperparameters and run metadata
    config={
    "learning_rate": 0.02,
    "architecture": "CNN",
    "dataset": "CIFAR-100",
    "epochs": 10,
    }
)

# simulate training
epochs = 10
offset = random.random() / 5
for epoch in range(2, epochs):
    acc = 1 - 2 ** -epoch - random.random() / epoch - offset
    loss = 2 ** -epoch + random.random() / epoch + offset

    # log metrics to wandb
    wandb.log({"acc": acc, "loss": loss})

# [optional] finish the wandb run, necessary in notebooks
wandb.finish()

After running this code, you should see a new project called my-awesome-project in your WandB account.

For your own WandB experiments, adding the API key will automate the login process so your own code should run automatically connected to your account.

6.7.5. Cleaning up the Environment

After the workload has completed, it can be deleted from the Workloads page, or left present as a reference for a future data download workload.

6.8. Distributed PyTorch Training Job

In this example, we demonstrate how to run a multi-node training job using the PyTorch training operator from Kubeflow.

6.8.1. Requirements

For running a Distributed PyTorch training job, a custom Docker container needs to be built. This is necessary to get relevant code into a container that can be run and shared for all future jobs.

  1. To build a custom Docker container, the Docker runtime needs to be installed on a local machine with the same CPU architecture as the DGX Cloud hosts (that is, if the DGX Cloud hosts have AMD64-based CPUs, the container needs to be built on an AMD64 machine. If the DGX Cloud hosts are arm-based, the container needs to be built on an ARM CPU). To install Docker locally, follow the Docker Engine Install guide.

  2. Additionally, you need to authenticate with your NGC private registry using your NGC key created during the Credentials section above. To authenticate with NGC, run the following commands (note, the username must be exactly “$oauthtoken”):

1  $ docker login nvcr.io
2  Username: $oauthtoken
3  Password: <NGC Key here>

6.8.2. Creating a Custom Docker Container

  1. On your local machine where Docker is installed, create and navigate to a directory to save the Dockerfile, such as pytorch-distributed:

    1  $ mkdir -p pytorch-distributed
    2  $ cd pytorch-distributed
    
  2. In the new directory, open a new file named run.sh and copy the following contents to the file:

    1  #!/bin/bash
    2
    3  torchrun multinode.py --batch_size 32 100 25
    4  cp snapshot.pt /checkpoints
    

    This is a very simple script that uses torchrun to launch a distributed training job and copies the generated checkpoint to the /checkpoints directory inside the container so it can be used again later.

  3. Save and close the file. Next, open another new file named Dockerfile and copy the following contents to the file:

    1  FROM nvcr.io/nvidia/pytorch:24.07-py3
    2
    3  WORKDIR /runai-distributed
    4  RUN git clone https://github.com/pytorch/examples
    5
    6  WORKDIR /runai-distributed/examples/distributed/ddp-tutorial-series
    7  COPY run.sh .
    

    This Dockerfile uses the 24.07 PyTorch container hosted on NGC as a base, clones the official PyTorch examples repository inside the container, and copies the run.sh file created previously into the container. Save and close the file.

  4. Once both files have been saved locally, build a container with the following command, replacing <ORG ID> with the ID of your private registry on NGC:

    docker build -t nvcr.io/<ORG ID>/pytorch-ddp-example:24.07-py3 .
    

    This will build the custom container locally.

  5. Once the build has finished, push the image to the NGC private registry with:

    docker push nvcr.io/<ORG ID>/pytorch-ddp-example:24.07-py3
    

    The custom container will be available in your private NGC registry and can be used immediately for jobs.

6.8.3. Creating a Data Source

To make it easier to reuse code and checkpoints in future jobs, a data source is created, specifically a PVC. The PVC can be mounted in jobs and will persist after the job completes so any data created can be reused.

  1. To create a new PVC, go to the Data Sources page. Click New Data Source then PVC to open the PVC creation form.

  2. On the new form, set the desired scope.

Important

PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

  1. Give the PVC a memorable name like distributed-pytorch-checkpoint and add a description if desired.

  2. For the data options, select a new PVC storage class that suits your needs according to the PVC recommendations here. In this example, zonal-rwx is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter 10 TB for the size to ensure we have plenty of capacity for future jobs. Select Filesystem as the volume mode. Lastly, set the Container path to /checkpoints which is where the PVC will be mounted inside containers. This completed section should look as shown below.

    Data source options.
  3. Once the form is complete, create the data source using the button at the bottom of the page.

Note

The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.

6.8.4. Creating the Workload via the UI

  1. To create the training job, navigate to the Workloads page and click New Workload > Training.

  2. On the project creation form page, select the desired project to run the job in.

  3. Select Distributed for the workload architecture to run a multi-node job. This adds a drop-down menu which lists the available frameworks for running distributed jobs. Since we are running a PyTorch job, select the PyTorch framework.

  4. The distributed training configuration has an option for either Workers & master or Workers only.

    Note

    In this documentation, we will use the term ‘primary’ instead of ‘master’ to align with modern terminology. Please note that the UI and commands may still refer to it as ‘master’.

  5. A primary node is typically required when running multi-node jobs that need to coordinate with one another, such as a job that does an all_reduce operation. The primary node can either be identical to the workers and run the training process or be a lightweight pod that only does coordination. If a job doesn’t need coordination between processes, the Workers only option can be used. For our example, we want the pods to coordinate with each other. Select the Workers & master option.

  6. Ensure the Start from scratch option is selected for the template.

  7. Give the job a name like “distributed-training-pytorch” and click Continue.

  8. In the new environment form that opens, select New environment. In the new form, provide a name for the environment such as “distributed-pytorch-env” and optionally add a description.

  9. In the Image URL, provide the name of the image tag that was specified during the container build in the Creating a Custom Docker Container section above, such as nvcr.io/<ORG ID>/pytorch-ddp-example:24.07-py3. This will use the container that was pushed to your private NGC org for the job.

  10. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check if there are updates to the image.

  11. Most of the remaining settings on the form can be kept as default with the exception of the runtime settings. The runtime settings indicate the directory to open pods to and what command to launch inside the containers if something other than the default container values are desired.

  12. In our example, we want to launch the multinode.py script with torchrun which runs a multi-process application where each process has its own unique rank. The PyTorch Training Operator coordinates with torchrun to automatically set the RANK, LOCAL_RANK, and WORLD_SIZE, among other environment variables, based on the total number of GPUs allocated for a job. In the Commands and Arguments field, enter torchrun multinode.py --batch_size 32 100 25 for the command. This will run the distributed training job on all allocated GPUs with a batch size of 32 for 100 total epochs and save a checkpoint every 25 epochs.

  13. For the directory field, put /runai-distributed/examples/distributed/ddp-tutorial-series. This is the location of the scripts that were saved inside of the container we built and is the path we want the pod to open to inside the container when it starts.

  14. After selecting the Create Environment button, you will be returned to the job creation page with the new environment selected. In the compute resource pane, select the desired number of workers. Note that since we are using a primary and worker configuration where we want the primary to participate in the training, we will specify one fewer worker node than the total number of nodes expected for the job. In other words, if we want to run an 8 node job where the primary will be coordinating processes as well as training, we will specify 7 workers and 1 primary (which is the next form). In our example we will train with 2 total nodes. As a result, select 1 worker in the form.

  15. For the compute resource, we want to train on a full DGX node with all available GPU resources. Select the compute resource in your project with a GPU capacity of 8.

  16. In the Data Sources pane, select the name of the PVC that was created in the previous section. The name of the PVC in this example is “distributed-pytorch-checkpoints”. Click Continue to navigate to the primary configuration page.

  17. As mentioned previously, the primary pod can have a different configuration from the worker pods for distributed PyTorch jobs. For our example, we want to use the same setup for workers and the primary so ensure the Allow different setup for the master radio button is deselected.

  18. Once finished, click Create Training to queue the training job.

6.8.5. Monitoring the Job

After the job is submitted, the status can be viewed in the Workloads page. Find the job name in the list, such as “distributed-training-pytorch” per the previous example. Clicking on the job will show a form with additional information about the job, including event history, metrics, and logs.

The event history tab shows the current status of the job, such as Creating, Initializing, Pending, Running, and so on. Additionally, it shows a log with timestamps of all of the states over time including any output messages when applicable. This is useful to see when a job is running or if it failed for any reason.

The metrics tab contains several graphs showing pod resource usage, including GPU, CPU, and memory utilization. This helps determine how much a job is stressing the hardware resources or potentially seeing if the application is under-utilizing resources. At the top of the tab, there is a dropdown to select metrics for a specific pod or for an aggregate of all pods for finer granularity.

Lastly, the logs tab shows the live STDOUT for each of the pods, including the aggregate output. In our example application, there will be output showing the training progress, including the epoch, batch size, and steps on a per-GPU basis. The logs are a good place to view training progress or possibly see if there are any errors in the process.

6.8.6. Getting the Checkpoint

The end of the run.sh script we created copies the latest generated checkpoint to the PVC attached to the job. Any job that uses this same PVC will now be able to load the checkpoint at /checkpoints/snapshot.pt inside a job. Similarly, any data can be saved to the PVC in a job at the specified filesystem path. This is helpful while generating checkpoints in long-running training jobs to continue progress in future jobs or change hyperparameters between jobs.

6.8.7. Cleaning up the Environment

After the job finishes, it can be deleted to free up resources for other jobs. Note that certain jobs do not self-terminate, and jobs should be monitored to ensure resources are not sitting idle for completed jobs.

6.9. Using Zero Quota Projects for Lower Priority Workloads

In this example, we walk through the process of creating a zero quota project for lower priority workloads, ensuring that they will always be preemptible.

This example is intended for cluster administrators managing the usage of the cluster, and demonstrates that certain workloads and projects can be prioritized over others for mixed usage of the cluster. In this manner, lower priority workloads can be preempted when higher priority workloads need to be run, and resumed once higher priority workloads are completed.

Once the project and quota configuration is completed by the cluster administrator, users can then run workloads in their assigned projects that fit the lower or higher priority use cases, and their workloads will be scheduled and preempted as configured.

6.9.1. Requirements

  • Access to the default department or a department that has enough quota to use the entire cluster’s resources. For more information on configuring departments and projects, see Departments.

  • Ability to create projects and configure quotas. For more information on required user roles, see Cluster Users and Managing Users.

    Note

    Only users with the role of Application Administrator, Department Administrator, Editor, and Research Manager can create projects.

  • No workloads are currently running in the cluster (if other workloads are running, they may interfere with the scheduler behavior).

    Note

    The Run:ai scheduler provides more configuration capabilities, such as over-quota priority, that can influence the behavior of workloads on the cluster (such as if other workloads are also running). For more information about the Run:ai scheduler, refer to the Run:ai documentation.

6.9.2. Creating the Maximum Quota Project via the UI

In this section, we will create a project with enough quota to use the entire cluster’s compute resources.

Note

This is a setup step that only needs to be executed once. The maximum quota project will persist unless removed.

  1. Navigate to the Projects page and click the + NEW PROJECT button.

  2. In the Scope field, ensure the default department is selected (or an alternative department with the ability to use all of the cluster’s available resources).

  3. Enter a name like full-quota-project in the Project name field.

  4. Under Quota management, set the GPU devices field to use all GPUs available to the department, and set the CPUs (Cores) and CPU memory fields to Unlimited.

  5. Change no other fields, and click the CREATE PROJECT button.

6.9.3. Creating the Zero Quota Project via the UI

In this section, we will create a project with zero quota. Workloads running in this project will still be able to use compute resources, but will always be over quota and subject to preemption.

Note

This is a setup step that only needs to be executed once. The zero quota project will persist unless removed.

  1. Navigate to the Projects page and click the + NEW PROJECT button.

  2. In the Scope field, ensure the same department is selected as the one used in the previous section.

  3. Enter a name like zero-quota-project in the Project name field.

  4. Under Quota management, set the GPU devices, CPUs (Cores), and CPU memory fields to 0.

  5. Change no other fields, and click the CREATE PROJECT button.

6.9.4. Creating the Lower Priority Interactive Workload via the UI

In this section, we will create and start a workspace in the zero-quota project.

Note

The same scheduling behaviors exist when using a training workload in a zero quota project. Some of the exact steps and environments used will differ.

  1. Navigate to the Workloads page and click the + NEW WORKLOAD button. Select the Workspace option.

  2. Select the zero-quota-project option under the Projects section.

  3. Select Start from scratch under the Template section.

  4. Enter a name like zero-quota-workspace under the Workspace name section and click the CONTINUE button.

  5. Select the jupyter-lab option under the Environment section.

  6. Select the h100-1g option under the Compute resource section.

  7. Open the General section and toggle the Allow the workspace to exceed the project's quota switch to on (it will appear blue when on). Otherwise, the workspace will not start.

    Note

    You can set a policy for workspaces to enable this option by default. To do so, set the spec.preemptible.rules.required: true parameter in an interactive workspace policy, following the instructions Workspaces Policy .

    More details about applying policies via CLI can be found in Run:ai documentation Former Policies .

  8. Click the CREATE WORKSPACE button.

  9. The UI will automatically return to the Workloads page - observe the zero-quota-workspace workload enter a Running state.

6.9.5. Creating the Higher Priority Training Workload via the UI

In this section, we will create and start a training workload in the full quota project. This training workload will preempt the workload initiated in the zero quota project.

  1. Navigate to the Workloads page and click the + NEW WORKLOAD button. Select the Training option.

  2. Select the full-quota-project option under the Projects section.

  3. Select the Distributed option under the Workload architecture section.

  4. Select the PyTorch option when the Set the framework for distributed trainings dropdown menu appears in the Workload architectures section.

  5. Select the Workers only option when the Set the distributed training configuration options appear in the Workload architectures section.

  6. Select Start from scratch under the Template section.

  7. Enter a name like full-quota-training under the Training name section and click the CONTINUE button.

  8. Click the + NEW ENVIRONMENT button under the Environment section.

  9. In the next screen, enter a name like simple-pytorch-environment under the Environment name & description section.

  10. Set the Image URL field to nvcr.io/nvidia/pytorch:24.07-py3 under the Image section.

  11. Click the CREATE ENVIRONMENT button.

  12. Upon being returned to the previous screen, ensure that the simple-pytorch-environment that we just created is selected in the Environment section.

  13. click the Runtime settings section.

  14. Click the + COMMAND & ARGUMENTS button.

  15. Enter sleep 5m under the Command field.

  16. In the Set the number of workers for your training field, choose a number that uses all available GPU nodes in the cluster. For example, in a 32 GPU cluster, that number will be 4 (8 GPUs per node, 8*4=32).

  17. Select the h100-8g option under the Compute resource section.

  18. Click the CREATE TRAINING button.

  19. The UI will automatically return to the Workloads page - observe the zero-quota-workspace workload entering a Pending state first, followed by the full-quota-training workload entering a Running state.

  20. If the full-quota-training workload is completed or is stopped, the zero-quota-workspace workload will resume if there are free resources available in the cluster.

6.9.6. Cleaning up the Environment

All test workloads and projects used in this example can be safely deleted.

6.10. End-to-end NeMo Framework Workflow

This workflow provides a full end-to-end example of preparing a dataset, training a foundation model based on Mixtral-8x7B, and deploying the model for non-production inference using NeMo Framework. This guide will be split into sub-sections to describe each part in detail.

6.10.1. Requirements

The following is a list of requirements to follow this complete workflow:

  • A local machine with the Run:ai CLI installed following this link.

  • A Hugging Face account with an API token (setup steps in the following section).

  • A Weights and Biases account with an API token (setup steps in the following section).

  • A Run:ai cluster with at least 4x A100 or newer GPU nodes.

  • A Run:ai project identified for use along with the corresponding Kubernetes namespace for that project (by default, the project name with a runai- prefix).

  • Argo Workflows set up and installed in the identified Run:ai namespace, as documented here.

  • A Run:ai user account with at least L1 Researcher privileges.

6.10.2. Initial setup

This guide uses two external services to simplify the LLM development process: Hugging Face and Weights & Biases.

Hugging Face contains resources for many of the most popular language models and datasets in the community. We can leverage these resources while training the model to minimize deployment steps and be consistent with community model assumptions.

This workflow walks through training a Mixtral-8x7B model from scratch. The dataset we use needs to be tokenized using a custom tokenizer. Luckily, MistralAI, the company that produced the Mixtral models, published their tokenizer for the Mixtral models on Hugging Face. In order to use the tokenizer, we need to create a Hugging Face account and accept the Mixtral-8x7B-v0.1 license on their model repository page. The following walks through that process.

6.10.2.1. Hugging Face Account Creation

If you don’t have a Hugging Face account already, create one by going to https://huggingface.co/join and signing up with your corporate email account.

Once your account is set up, go to https://huggingface.co/settings/tokens while logged in to create a personal access token. Create a new token with Read access and give it a memorable name. Save the generated token in a safe place, as it won’t be viewable again for security reasons.

6.10.2.2. Accepting Mixtral-8x7B License

As mentioned earlier, this example uses the official Mixtral-8x7B tokenizer available on Hugging Face, which requires agreeing to their license on their model page. To do so, navigate to https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 while logged in. Read the privacy policy at the top of the model card page, then click the Agree and access repository button towards the top of the page to accept the license. Now, you can download resources from this repository using your personal access token.

6.10.2.3. Creating Weights & Biases Account

Weights & Biases is a tool that allows developers to easily track experiments for AI applications. NeMo Framework natively supports logging many values such as training loss, learning rate, and gradient norm as well as resource utilization with Weights & Biases. Weights & Biases is highly recommended for tracking NeMo Framework jobs.

To get started with Weights & Biases, navigate to https://wandb.ai in a web browser and click the Sign Up button in the top right to create a free account. Once logged in, go to https://wandb.ai/settings and go to the bottom to create a new API key. This API key will be used while launching workflows to automatically log to Weights & Biases.

6.10.2.4. Create a PVC

Training an LLM requires a lot of data, including pre-training datasets, multiple checkpoints, long log files, configs, and scripts. These files typically need to be read from all nodes, so we need shared storage that all pods can access concurrently. For this, we can use a PVC that will store all of our training resources.

Note

It is strongly recommended to allocate as much storage as practical for a PVC. With GCP clusters, the larger the PVC size, the faster the read and writes will be, enabling large checkpoints to be saved more quickly during training, reducing the overall training time. If possible, allocate the entire cluster storage capacity in a single PVC and share that PVC amongst all workloads with unique subdirectories for each workload to take advantage of the highest possible storage performance.

To create a PVC, go to the Data Sources tab in the Run:ai web UI and follow these steps:

  1. Click the New Data Source button at the top of the page followed by PVC.

  2. Select the scope that corresponds with the project you will be training the model in. For example, if your project is named default, then select the default project scope.

  3. Enter a memorable name for the PVC, such as nemo-workspace and optionally, give it a description.

  4. For the data mount, select New PVC.

  5. Select the zonal-rwx storage class and Read-write by many nodes access mode.

  6. For the claim size, enter at least 10 TB. If training a larger model and using a larger dataset, it might be necessary to request more storage capacity.

  7. Enter /nemo-workspace for the container path. This will mount the PVC at /nemo-workspace inside all pods that attach this PVC.

  8. Click Create Data Source once you have finished filling out the form to create the PVC.

6.10.3. Create Training Helper with JupyterLab

We will be creating a workflow that sets up JupyterLab to save and modify helper scripts as well as monitor the data prep and training processes. The process will be very similar to the setup in the Interactive NeMo Workload Job guide.

  1. Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch.

  4. Enter a descriptive name for your Workspace, such as nemo-training-helper. Click Continue. After a few seconds, the Environment pane of the creation process will appear.

  5. Click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  6. In the Environment creation form, enter a name for the environment, such as “nemo-jupyter” and optionally add a description.

  7. Under Image URL, put nvcr.io/nvidia/nemo:24.07. This pulls the latest NeMo container from NGC as of the time of writing.

  8. Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already.

  9. Click the Tools pane to open the tools menu, and click +Tool. Select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate, and Container port 8888.

  10. Click the Runtime settings pane, then click to expand the commands and arguments pane. Enter jupyter-lab as the command, and --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser for the arguments.

  11. For the directory, specify /nemo-workspace as the directory to launch the job in.

  12. Click Create Environment in the bottom right of the page. The environment you just created should now be selected.

  13. Go to the Compute resource pane and select a CPU-only compute resource for your environment.

  14. Select the nemo-workspace PVC created in the previous section under the Data sources form.

  15. Go to the bottom of the page and click Create Workspace.

  16. After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”.

  17. Once the status is “Running”, you can launch the JupyterLab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.

With the JuptyerLab session running, we can open a Terminal in the browser to interact with the PVC. It is helpful to keep this workflow running throughout the entire project in case files need to be inspected.

6.10.4. Data Preparation

The NeMo Framework supports processing custom text-based datasets for pre-training new models. The data preprocessor requires datasets to be cleansed, excluding any sensitive or improperly formatted data that is unsuitable for use during pre-training. Each file in the dataset must be in .json or, ideally, .jsonl format. Datasets can be downloaded from external sources or uploaded directly to the PVC.

The following example walks through downloading, extracting, concatenating, and preprocessing the SlimPajama dataset which includes a large corpus of text from several domains and has been deduplicated and cleaned to make it a great candidate for pre-training LLMs. While the remainder of the document will be based on the SlimPajama dataset, this general process can be followed for most custom datasets and will provide guidance on how to adapt as needed.

6.10.4.1. Script Setup

We will leverage four different scripts to prepare the SlimPajama dataset for pre-training a Mixtral-8x7B-based LLM. These scripts will be saved in the PVC that was created during the initial setup step. The scripts are as follows:

Download

The first script downloads the entire SlimPajama-627B training dataset from Hugging Face to the mounted PVC. The dataset is spread across nearly 60,000 individual shards, all needing to be downloaded independently. To make the process faster, the job leverages PyTorch distributed communication to spread the download equally amongst all workers in the cluster. Using the JupyterLab session created previously, save the following file in the PVC at /nemo-workspace/download.py.

Note

The dataset is evenly divided amongst ten chunks on Hugging Face, each being its own subdirectory of files. The download.py script below has a CHUNKS = 10 variable at the top of the file to download all ten chunks. If desired, this value can be reduced to only download the first N chunks of the dataset. This is useful for quick workload validations that don’t rely on a complete dataset. The remainder of this document will assume all ten chunks are pulled from but the steps will still work if using less.

import os
import requests
import time
import torch

CHUNKS = 10
SHARDS = 6000

torch.distributed.init_process_group()

wrank = int(os.environ.get('RANK', 0))
wsize = int(os.environ.get('WORLD_SIZE', 0))

def download(url, filename, retry=False):
    if os.path.exists(filename):
        return

    response = requests.get(url)

    # In case of getting rate-limited, wait 3 seconds and retry the
    # download once.
    if response.status_code == 429 and not retry:
        time.sleep(3)
        download(url, filename, retry=True)

    if response.status_code != 200:
        return

    with open(filename, 'wb') as fn:
        fn.write(response.content)

def split_shards(wsize):
    shards = []
    shards_to_download = list(range(SHARDS))

    for shard in range(wsize):
        idx_start = (shard * SHARDS) // wsize
        idx_end = ((shard + 1) * SHARDS) // wsize
        shards.append(shards_to_download[idx_start:idx_end])
    return shards

for chunk in range(1, CHUNKS + 1):
    shards_to_download = split_shards(wsize)

    for shard in shards_to_download[wrank]:
        filename = f'example_train_chunk{chunk}_shard{shard}.jsonl.zst'
        url = f'https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/train/chunk{chunk}/example_train_{shard}.jsonl.zst'
        download(url, filename)

# Block individual processes from exiting until all ranks are finished
# with their work to start the next step at the same time
torch.distributed.barrier()
Extract

The individual dataset shards are compressed in the Zstandard or .zst format and must be decompressed. The following script divides the downloaded files amongst all ranks and decompresses shard before removing the compressed downloads to keep the PVC clean. Using the JupyterLab session, save the script in the PVC as /nemo-workspace/extract.py.

import os
import requests
import subprocess
import torch
from glob import glob

torch.distributed.init_process_group()

wrank = int(os.environ.get('RANK', 0))
wsize = int(os.environ.get('WORLD_SIZE', 0))

def split_shards(wsize, dataset):
    shards = []

    for shard in range(wsize):
        idx_start = (shard * len(dataset)) // wsize
        idx_end = ((shard + 1) * len(dataset)) // wsize
        shards.append(dataset[idx_start:idx_end])
    return shards

dataset = glob('example_train*')
shards_to_extract = split_shards(wsize, dataset)

for shard in shards_to_extract[wrank]:
    subprocess.run([f"unzstd --rm {shard}"], shell=True)

# Block individual processes from exiting until all ranks are finished
# with their work to start the next step at the same time
torch.distributed.barrier()
Concatenate

Given the SlimPajama dataset contains nearly 60,000 files, it is helpful to concatenate them into fewer, larger files. It is much faster to process a smaller number of large files than a large number of small files and will, in turn, eliminate potential data bottlenecks during the pre-training stage.

The following script takes 1,200 individual shards at a time and combines them into one large file, repeating for the entire dataset. Each rank concatenates a unique subsection of the dataset and deletes the individual shards in the end. Using the JupyterLab session, save the script in the PVC as /nemo-workspace/concat.sh.

Note

The script combines 1,200 individual shards by default into a single file. For the complete dataset, this will yield 50 larger combined files representing the data, each being approximately 51 GB in size. To change how many shards are used in each file, increase or decrease the shards_per_file variable below. A larger number will result in fewer files that are larger in size. A smaller number will result in more files that are smaller in size.

#!/bin/bash
shards_per_file=1200
num_files=`find -name 'example_train_chunk*.jsonl' | wc -l`
files=(example_train_chunk*.jsonl)
rank=$RANK
world_size=$WORLD_SIZE

# Find the ceiling of the result
shards=$(((num_files+shards_per_file-1)/shards_per_file ))

echo "Creating ${shards} combined chunks comprising ${shards_per_file} files each"

for ((i=0; i<$shards; i++)); do
  if (( (( $i - $rank )) % $world_size )) ; then
    continue
  fi
  file_start=$((i*shards_per_file))

  if [[ $(((i+1)*shards_per_file)) -ge ${#files[@]} ]]; then
    file_stop=$((${#files[@]}-1))
  else
    file_stop=$(((i+1)*shards_per_file))
  fi

  echo "  Building chunk $i with files $file_start to $file_stop"
  cat ${files[@]:$file_start:$shards_per_file} > slim_pajama_${i}.jsonl
  rm ${files[@]:$file_start:$shards_per_file}
done
Preprocess

Once all of the files have been concatenated, it is time to preprocess the dataset. The preprocessing phase tokenizes each dataset file using the Mixtral-8x7B tokenizer which is downloaded from Hugging Face and creates .bin and .idx files for each concatenated file. As with the other scripts, this one divides the work amongst all available workers to speed up preprocessing. Using the JupyterLab session, save the following script in the PVC as /nemo-workspace/preprocess.py.

Note

As mentioned, this script uses the Mixtral-8x7B tokenizer as the intent is to use this data for pre-training a Mixtral-8x7B model. However, the tokenizer can be swapped out for a different one available on Hugging Face if pre-training a different model is desired. For example, the Llama3.1-8B tokenizer from Meta can be used instead by replacing both references of mistralai/Mixtral-8x7B-v0.1 in the script with the repo ID of the Llama3.1-8B model, meta-llama/Meta-Llama-3.1-8B and update the filename and path to the tokenizer in the model repo which happens to be filename=original/tokenizer.model. Be sure to accept any applicable licenses on the model repository page.

import os
import requests
import subprocess
import torch
from datetime import timedelta
from glob import glob

from huggingface_hub import hf_hub_download

# Wait for all processes to be fininshed with the previous step
# before starting preprocessing
torch.distributed.init_process_group(backend='gloo')
torch.distributed.monitored_barrier(timeout=timedelta(hours=4))

wrank = int(os.environ.get('RANK', 0))
wsize = int(os.environ.get('WORLD_SIZE', 1))

def split_shards(wsize, dataset):
    shards = []

    for shard in range(wsize):
        idx_start = (shard * len(dataset)) // wsize
        idx_end = ((shard + 1) * len(dataset)) // wsize
        shards.append(dataset[idx_start:idx_end])
    return shards

dataset = sorted(glob('slim_pajama*jsonl'))
shards_to_extract = split_shards(wsize, dataset)

if wrank == 0:
    # Download a specific file from a repository
    hf_hub_download(
        repo_id="mistralai/Mixtral-8x7B-v0.1",
        filename="tokenizer.model",
        local_dir="/nemo-workspace/tokenizers/mixtral-8x7b"
    )

for num, shard in enumerate(shards_to_extract[wrank]):
    shard_num = wrank + (num * wsize)  # Counter for which file is processed
    command = (
        "python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py "
        f"--input {shard} "
        f"--output-prefix /nemo-workspace/mixtral-slim-pajama-{shard_num} "
        f"--dataset-impl mmap "
        f"--tokenizer-type mistralai/Mixtral-8x7B-v0.1 "
        f"--tokenizer-library huggingface "
        f"--tokenizer-model /nemo-workspace/tokenizers/mixtral-8x7b/tokenizer.model "
        f"--workers 80"
    )
    subprocess.run([command], shell=True)

6.10.4.2. Launching Data Preparation

Note

In this documentation, we will use the term ‘primary’ instead of ‘master’ to align with modern terminology. Please note that the UI and commands may still refer to it as ‘master’.

Once all the scripts are saved in the PVC, it is time to launch the preprocessing job.

First, determine the total number of workers for the job. The greater the number of workers, the quicker the entire data preparation process will finish. The preprocessing step requires a GPU so you are limited by the number of GPUs available in your cluster. Assuming the resources aren’t needed for other purposes, it is recommended to create as many workers as there are GPUs available in the cluster for fastest end-to-end preparation time. Make a note of the desired number of workers, as this will be required in a few steps below.

Note

The example here will use single-GPU compute types for each worker. Run:ai limits the amount of CPU resources available in compute types that are smaller than a full node (that is, single GPU or two GPUs). Given this phase is primarily a CPU-bound process, the lower CPU resources will be a bottleneck in the process. If desired, a larger compute type with more GPUs can be used to take advantage of additional CPU resources for each worker. The tradeoff is with larger compute types, there will be fewer available resources, resulting in less workers. Both paths are valid but if your cluster has a large number of GPU resources, it is recommended to use larger compute types with multiple GPUs. More on this later.

  1. Go to the Run:ai web UI in a browser and open the Workloads page.

  2. Click the blue New Workload button followed by Training to start the creation process for a distributed training job.

  3. In the new form that opens, select the desired project to run the job in.

  4. In the workload architecture box, select the Distributed radio button, set the framework to PyTorch in the drop-down list, and select Workers & master for the workload configuration.

  5. Select Start from scratch for the template if not already selected.

  6. Give the job a name such as slim-pajama-data-prep-mixtral and hit the continue button.

  7. On the new page, select +New Environment to create a new environment with our container. This opens another form for creating the environment.

  8. In the environment creation page, enter a name for the environment, such as nemo-2407, and optionally add a description.

  9. For the image URL, enter nvcr.io/nvidia/nemo:24.07, the latest training container at the time of writing. As newer containers are released, the tag can be updated to reflect the latest version.

  10. In the Runtime settings section, add a new Environment Variable with the name LD_LIBRARY_PATH and the following value which is the expected path for loading CUDA libraries in the NeMo 24.07 container.

    /usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/tensorrt/lib:/usr/local/cuda/lib64:/usr/local/tensorrt/lib
    
  11. Once finished setting up the environment, click the Create Environment button at the bottom of the page which will take you back to the worker setup form.

  12. In the worker setup form, ensure the newly created environment is selected for the environment.

  13. Expand the Runtime settings option and add a new command. For the command, put bash -c in the command box and the following code in the arguments field. Replace <HUGGING FACE TOKEN HERE> with your Hugging Face token created earlier and all three instances of <NUM WORKERS> with the desired number of workers determined at the beginning of this step. Be sure to include the single quotes at the start and end of the arguments field.

    'huggingface-cli login --token <HUGGING FACE TOKEN HERE> && apt update && apt install -y zstd && cd /nemo-workspace && torchrun --nnodes=<NUM WORKERS> --nproc-per-node=1 download.py && torchrun --nnodes=<NUM WORKERS> --nproc-per-node=1 extract.py && bash concat.sh && torchrun --nnodes=<NUM WORKERS> --nproc-per-node=1 preprocess.py'
    
  14. Confirm that the LD_LIBRARY_PATH EV is still set in the Environment Variable field and set it again if not.

  15. Select the number of workers for your job. Note that since we used Workers & master for our distributed architecture, we will enter one fewer worker in this field than the total we determined at the start of this section. This is because the primary pod will coordinate all of the processes but still process its share of the dataset, so it will be the last worker on the list. In other words, if the desire is to use 16 total workers, put 15 for the number of workers on this form.

  16. Select a compute resource for the job. It is recommended to use a single GPU compute resource for all of the workers but as stated in the note at the top of this section, compute resources with more than one GPU per worker can be used to leverage more CPU resources in each pod.

  17. In the Data sources section select the PVC that was created earlier in this example. This will mount the PVC at /nemo-workspace inside all of the pods. Click the Continue button at the bottom of the page to open the primary configuration form.

  18. In several distributed PyTorch applications, it is common to use a different setup for the primary resource to coordinate distributed communication. A typical example is seting up the primary as an etcd server that acts as the PyTorch rendezvous agent. For our purposes, we want the primary to be just another worker and do everything that the workers do. Ensure the Allow different setup for the master toggle is deselected so the master uses an identical configuration to the workers. This will copy all of the settings from the previous workers’ form to the primary form. Click Create Training to launch the dataset preparation workflow.

After creating the data preparation job, a pod for each worker and primary will be scheduled and started once resources become available on the cluster. The process can be monitored by viewing the logs in the Run:ai UI and by connecting to the helper JupyterLab terminal and viewing the data in the PVC. The /nemo-workspace will evolve throughout the process with the following changes at the end of each stage:

  • After downloading, there will be 59,166 compressed data shards named example_train_chunkX_shardY.jsonl.zst where X is the chunk number from 1-10 and Y is the individual shard number within that chunk. Each file is approximately 15 MB in size.

  • After extraction, there will be 59,166 unzipped data shards named example_train_chunkX_shardY.jsonl and all of the compressed .zst files will be removed. Each file is approximately 44 MB in size.

  • After concatenation, there will be 50 large, combined files named slim_pajama_N.jsonl where N ranges from 0-49. Each file will be approximately 51 GB in size. It is normal for the last file to be smaller in size as it doesn’t contain an even 1,200 shards. All of the individual example_train* files will be removed.

  • After preprocessing, there will be 50 .bin files and 50 .idx files named mixtral-slim-pajama-N_text_document, where N corresponds to the combined data file number. Each .bin file should be approximately 26 GB in size and .idx files should be 229 MB.

Once all 50 files have been preprocessed, it is time to begin pre-training the model.

6.10.5. Pre-Training

NeMo Framework contains many predefined configuration files for various models, including the Mixtral-8x7B model. This section will demonstrate how to initiate training a Mixtral-8x7B model on Run:ai using the preprocessed SlimPajama dataset.

Pre-training is the most compute-intensive phase of the LLM training process as the model is typically trained for hundreds of billions to trillions of tokens while it learns the vocabulary and word pairings of the underlying dataset. Depending on the size of the dataset and model as well as the amount of compute resources available to train the model, this process can take anywhere from several days to a few months to finish. Therefore it is strongly recommended to leverage as much of your available compute power as possible for pre-training the model.

6.10.5.1. Setup

Some minor setup is required prior to launching the job. First, we need to get the path and weight to each dataset file so NeMo Framework knows which files to use for pre-training. The NeMo container has a script to generate this information.

To generate the data, open up the nemo-training-helper Jupyter session in the browser again that was used during data prep setup. In a terminal session, run the following code:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/auto_blend.py model_type=gpt preprocessed_dir=/nemo-workspace

This will output a list of the full paths to each item in the dataset as well as the porportional weight of each file based on the file size. The weight indicates how much of the dataset should be sampled from each file, giving the larger files a greater portion of the sample size. The script automatically does this for the most efficient sampling of all files. The list should look similar to the following list (truncated for brevity):

[0.020294,'/nemo-workspace/mixtral-slim-pajama-6_text_document',0.020406,'/nemo-workspace/mixtral-slim-pajama-27_text_document',0.020262,'/nemo-workspace/mixtral-slim-pajama-45_text_document',0.020367,'/nemo-workspace/mixtral-slim-pajama-14_text_document',0.020332,'/nemo-workspace/mixtral-slim-pajama-36_text_document',...]

Copy the generated output and save it for later, as it will be used while launching the training job.

Next, we need to download the launcher repository and set up our environment. Do so with the following:

  1. Clone the launcher repository on a machine with kubectl access to the Run:ai cluster using:

    git clone https://github.com/NVIDIA/nemo-framework-launcher
    cd nemo-framework-launcher
    
  2. Install the Python dependencies (a virtual or conda environment is recommended):

    pip3 install -r requirements.txt
    cd launcher_scripts
    
  3. Find the name of the PVC in kubernetes with kubectl get pvc. In the example below, the name of the PVC according to kubernetes is nemo-workspace-project-u2l6h. Save the name for later as it will be used to specify which PVC to use.

    $ kubectl get pvc
    NAME                           STATUS  VOLUME                                     CAPACITY  ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
    nemo-workspace-project-u2l6h   Bound   pvc-a184eba2-2e16-4530-a2d8-7f0839df27d0   50Ti      RWX            zonal-rwx      <unset>                 37d
    

Now the training job can be launched. NeMo Framework has default configs for each of the supported models available on GitHub. The specific config we will use as the baseline for our job can be found here.

NeMo Framework uses Hydra to specify all of the settings used for training jobs which are shown in the linked config file above. These settings can be overridden from the command line as needed. The following command will launch the pre-training job with the explanation of each flag below:

PVC_NAME=<INSERT PVC NAME HERE>  # Replace with your PVC name captured earlier
MOUNT_PATH=/nemo-workspace
python3 main.py \
  launcher_scripts_path=$PWD \
  data_dir=/$MOUNT_PATH \
  cluster=k8s_v2 \
  "+env_vars={WANDB_API_KEY: <INSERT WANDB API KEY HERE>, HF_TOKEN: <INSERT HF TOKEN HERE>}" \
  ~cluster.volumes.dshm \
  cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \
  cluster.volumes.workspace.mount_path=$MOUNT_PATH \
  cluster.scheduler=runai-scheduler \
  cluster.service_account=argo \
  cluster.namespace=runai-demo-project \
  cluster.pull_secret=null \
  'stages=[training]' \
  training=mixtral/mixtral_8x7b \
  training.exp_manager.create_wandb_logger=true \
  "training.exp_manager.explicit_log_dir=$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results" \
  training.trainer.num_nodes=8 \
  training.trainer.max_steps=300000 \
  training.model.context_parallel_size=2 \
  training.model.sequence_parallel=false \
  training.model.data.data_prefix=<INSERT DATA PATHS AND WEIGHTS HERE> \
  training.model.encoder_seq_length=8192 \
  training.model.data.seq_length=8192 \
  training.model.ub_tp_comm_overlap=false

The flag explanations are as follows:

  • PVC_NAME: Specify the name of the PVC captured earlier.

  • MOUNT_PATH: This is the path that the PVC will be mounted inside the container. Keep this as /nemo-workspace.

  • data_dir=/$MOUNT_PATH: This is the location of the dataset inside the PVC.

  • cluster=k8s_v2: This indicates the job will run on Kubernetes.

  • "+env_vars={WANDB_API_KEY: <INSERT WANDB API KEY HERE>, HF_TOKEN: <INSERT HF TOKEN HERE>}": Replace <INSERT WANDB API KEY HERE> with your Weights and Biases API key created during the setup phase. This will allow the job to authenticate with the W&B servers. Similarly, replace <INSERT HF TOKEN HERE> with your Hugging Face token created earlier to authenticate with Hugging Face.

  • ~cluster.volumes.dshm: DGX Cloud on Run:ai automatically allocates shared memory in each pod and it will cause a conflict if NeMo Framework tries to allocate this memory as well. To prevent this, we need to remove the dshm line in the config so it isn’t allocated. The ~ at the beginning of the line removes the key.

  • cluster.volumes.workspace.persistent_volume_claim.claim_name: This is the name of the PVC to attach to the job.

  • cluster.volumes.workspace.mount_path: This tells NeMo Framework where the PVC will be mounted inside the container.

  • cluster.scheduler=runai-scheduler: By default, the job will try and use the default kuberentes scheduler but we want to use the Run:ai scheduler instead.

  • cluster.service_account=argo: Running Argo workflows on the cluster requires a service account. By default, the argo service account should be used. Replace this with the name of your service account as needed.

  • cluster.namespace=runai-demo-project: This is the kubernetes namespace the job will run in. The namespace will typically map to the Project name in Run:ai with runai- prepended to it. For example, if the Project is named demo-project, the namespace will typically be runai-demo-project. You can list all namespaces with kubectl get ns.

  • cluster.pull_secret=null: Run:ai also automatically injects the Docker container registry secret into jobs so this field can be kept empty.

  • 'stages=[training]': Specify that we want to run the training phase of NeMo Framework.

  • training=mixtral/mixtral_8x7b: This indicates we want to train a Mixtral-8x7B model. To train a different model, specify one of the configs listed in the repostiory here.

  • training.exp_manager.create_wandb_logger=true: Use Weights and Biases for logging all metrics.

  • "training.exp_manager.explicit_log_dir=$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results": This is the location to save all of the training results for the job. This will be saved in the PVC for future use.

  • training.trainer.num_nodes=8: Specify the number of nodes to run on. As mentioned in the beginning of the section, it is highly recommended to use as many nodes as possible.

  • training.trainer.max_steps=300000: This is the total number of steps we want to train our model for. Given our dataset, SlimPajama, has 627 billion tokens, we want to train for at least one epoch. Each step represents a forward pass of the transformer network. In the forward pass, a number of tokens are trained equal to the global batch size times the sequence length. Given the default batch size of the Mixtral-8x7B model is 256 and we are using a sequence length of 8K tokens, each forward pass trains on 8,192 x 256 = 2,097,152 tokens. So, to train on all of the 627 billion tokens of the dataset, we will run for 627,000,000,000 / 2,097,152 = 298,976 steps which we round up to 300,000 total steps.

  • training.model.context_parallel_size=2: This adds a level of parallelism for the context to reduce the memory footprint on each GPU. As context windows get larger the overall memory requirements become larger. Context parallelism makes it easier to fit large context windows in GPU memory.

  • training.model.sequence_parallel=false: Disable sequence parallelism in favor of other parallelism techniques.

  • training.model.data.data_prefix: This is where you specify the weights and paths of the data that was generated from the python3 command earlier.

  • training.model.encoder_seq_length=8192: Specify the sequence length for the model. This is also referred to as the context length and represents the number of input and output tokens that the model can handle. Recent research suggests doing a bulk of the training on a smaller context length of around 8k tokens and increasing the context length after fully pre-training the model offers the best stability.

  • training.model.data.seq_length=8192: As with the previous flag, specify an 8K context length.

  • training.model.ub_tp_comm_overlap=false: DGX Cloud does not support overlapping communications and this should be disabled.

    Note

    The global batch size (GBS) is dependent on the number of GPUs in the job, the micro batch size (MBS), tensor parallelism (TP), and pipeline parallelism (PP) sizes. Specifically, GBS % (MBS * num GPUs) / (PP * TP) must equal 0. For example, the Mixtral-8x7B model has a default GBS of 256, MBS of 1, TP of 1, and PP of 4. Given we are using 8 nodes with 8 GPUs each, this results in 256 % (1 * (8 * 8)) / (1 * 4) == 0 and is a valid config. If changing the node or parallelism sizes, be sure the global batch size still satisfies this equation.

After running the Python command above, the job will be scheduled with Run:ai and launched once resources become available. The job will appear in the Run:ai Workloads page after submission. The following images show the workload details after it has been running for a few days.

Mixtral event history Mixtral resource usage

NeMo Framework is fully integrated with Weights and Biases and logs multiple metrics that can be viewable on their website. If the W&B key was provided in the command, a new W&B project will automatically be created and metrics will be uploaded there. Viewing logs on W&B is recommended as the best path to monitor training progress.

To view your charts, navigate to https://wandb.ai. You should see a link to the newly created project on your home page. Clicking the link will take you to your project dashboard which should look similar to the following. Note that the figure below includes training results for two different runs where the second run is a continuation of the first.

W&B pre-training charts

Two of the most important charts to monitor during pre-training are the reduced_train_loss and val_loss charts which show how the model is learning over time. In general, these charts should have an exponential decay shape.

The job will take around four weeks to complete on 8 nodes. Since NeMo Framework pre-training scales linearly, doubling the number of nodes should halve the amount of time required to pre-train the model.

While the model trains, a checkpoint will be saved every 2,000 steps in the PVC. Per the command above, the checkpoints will be saved in the /nemo-workspace/mixtral_8x7b/training_mixtral_8x7b/results/checkpoints directory. Only the 10 checkpoints with the best val_loss values as well as the latest checkpoint will be saved.

After the job finishes, a .nemo file will be saved in the checkpoint directory. This represents the pre-trained model and can be used for several downstream tasks including fine-tuning and inference.

6.10.6. Inference Deployment

Now that we have finished pre-training a base model, we can deploy it for inference and send requests to the deployed model to do quick human evaluations.

Warning

This section is NOT intended for production inference deployments. The purpose of this section is to provide a quick way for engineers, QA teams, and other internal stakeholders to evaluate the model with user-generated prompts and inform decisions on the model’s readiness. A production deployment would include load balancing, auto-scaling, optimized inference code, a full API, and more.

To deploy the model for inference, navigate to the Workloads page and click the + New Workload > Inference button and follow these steps:

  1. In the new form that opens, select the desired project to run the job in.

  2. Enter a name for the inference deployment, such as mixtral-8x7b-base-model-deploy and click the Continue button.

  3. Create a new environment by clicking the + New Environment button.

  4. In the environment creation page, enter a name for the environment such as nemo-2409-inference and optionally add a description.

  5. For the image URL, enter nvcr.io/nvidia/nemo:24.09 which is the latest training container at the time of writing. As newer containers are released, the tag can be updated to reflect the latest version.

  6. In the Endpoint section, ensure HTTP is selected for the protocol. Enter 8080 for the container port. When we specify the command later on, we designate port 8080 to listen to requests. If a model is deployed on a different port, specify that here.

  7. Once finished setting up the environment, click the Create Environment button at the bottom of the page which will take you back to the worker setup form.

  8. Ensure the newly-created nemo-2409-inference environment is selected in the Environment section.

  9. In the Runtime settings section for the environment, set the command to:

    bash -c
    

    And the arguments to:

    'cd /opt/NeMo && git checkout main && git pull && pip install lightning && python3 scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /nemo-workspace/mixtral_8x7b/training_mixtral_8x7b/results/checkpoints/megatron_mixtral.nemo --tensor_parallelism_size 4 --start_rest_service True --triton_model_name mixtral-8x7b --model_type mixtral --max_input_len 4096 --max_output_len 8192'
    

    Note, if deploying a different model, the settings above can be changed to fit your model. These settings are as follows:

    • --nemo_checkpoint: The path to your .nemo file to deploy.

    • --tensor_parallelism_size: The number of GPUs to deploy on. Larger models will require additional GPUs. In general, for every 1 billion parameters the model has, you will need 2 GB of total GPU memory. Increasing tensor parallelism size will split the required memory amongst GPUs.

    • --triton_model_name: What the model should be deployed as with Triton. This will be used while sending requests.

    • --model_type: The model type, such as mixtral or llama.

    • --max_input_len: The maximum number of tokens to allow in the input prompt.

    • --max_output_len: The maximum number of output tokens to generate for responses.

  10. Add two environment variables in the Runtime settings section:

    1. Add your Hugging Face token with HF_TOKEN as the key and your token as the value.

    2. Specify the port to listen to Triton requests with key TRITON_PORT and value 8000.

  11. In the Compute resource section select the compute type that includes four GPUs as the model needs four GPUs to fit in GPU memory. If using a different model, the number of GPUs should match the tensor parallelism size.

  12. In the Data sources section select the PVC that was created earlier in this example. This will mount the PVC at /nemo-workspace inside the pod. Click the Create Inference button at the bottom of the page to create the deployment.

Back in the Workloads page you will see the newly created inference workload. It will take some time for the model to be converted to TRT-LLM engines before it transitions to the Running state. Once the deployment is running, it can start to handle requests.

6.10.6.1. Sending Requests to Deployed Model

The easiest way to send a request to the deployed model is via curl in a terminal from an IP address in the cluster’s allowlist. The basic request structure is as follows:

curl -X POST https://X.X.X.X/v1/chat/completions/ \
  -H 'content-type: application/json' \
  -H 'accept: application/json' \
  -d '{"prompt": "Write me a short story about a baby dragon that learns to fly", "model": "<trt-model-name>", "max-tokens": 2048, "top_p": 0, "top_k": 0.9, "temperature": 1.0}'

To find the URL, run kubectl get ksvc in a terminal with the kubeconfig for the cluster configured. Find the knative service that corresponds to the deployment. The URL will be in the second column. In the following example, the URL would be https://mixtral-8x7b-base-model-deploy-runai-demo-project.inference.<cluster>.ai.

$ kubectl get ksvc
NAME                             URL                                                                                LATESTCREATED                          LATESTREADY                            READY   REASON
mixtral-8x7b-base-model-deploy   https://mixtral-8x7b-base-model-deploy-runai-demo-project.inference.<cluster>.ai   mixtral-8x7b-base-model-deploy-00001   mixtral-8x7b-base-model-deploy-00001   True

In the above curl command, replace X.X.X.X with the service IP address captured in the previous step. Additionally, replace Write me a short story about a baby dragon that learns to fly with your prompt of choice and <trt-model-name> with the name of your TRT model specified during deployment. This command will generate 2048 tokens but this can be changed as needed depending on the prompt.

After submitting the command, it will be passed to the deployed model which will generate a response to the prompt.

The response should look similar to the following (response truncated - actual responses will vary):

{"output":"and having adventures.\nAsked by: Dayanida (6 years, 4 months ago)\nEdit: I am drawing it with Paint Tool SAI and Photoshop CS3.\nUpdated to try and get better.\nAnswered by: Rebecca (12 years, 5 months ago)\nWrite me a story about an adventure in the land of Wandreon where you can choose your own adventure..."}

The model’s response will be in the output key and will follow directly after the last token in the prompt. For example, combining the end of the input prompt and the start of the response would be “…that learns to fly and having adventures…”

6.10.6.2. Cleaning Up

When the deployment is no longer needed, it can be stopped to free up additional compute resources.

To stop the job, go to the Worloads page on Run:ai and select the select the mixtral-8x7b-base-model-deploy job and click the Delete button towards the top-left of the panel.

6.11. Using BioNeMo Framework for ESM-2nv Data Preprocessing and Model Training

This tutorial provides an example use case of training a BioNeMo Large Language model using the BioNeMo framework. The tutorial will give users experience in:

  1. Preprocessing the UniRef50 and UniRef90 data for ESM-2nv

  2. Pretraining and continuing training from a checkpoint for ESM-2nv

  3. Launching interactive and training workloads on Run:ai on DGX Cloud

For more information on the NVIDIA BioNeMo framework and the ESM-2nv model, refer to the BioNeMo documentation.

6.11.1. Prerequisites and Requirements

The following are required before running the interactive NeMo job:

  • You must have accepted an invitation to your NGC org and added your NGC credentials to the Run:ai. Refer to Accessing Your NGC Org.

  • You must have the user role of L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.

  • Your user must be able to access a project and department.

  • At least one A100 or newer GPU with 80 GB of memory.

  • You must have access to a compute resource with at least one GPU created in your scope that you can use.

6.11.2. Creating the Data Source

We will create a PVC Data Source to share the resulting dataset with future workloads.

  1. To create a new PVC, go to the Data Sources page. Click New Data Source and then PVC to open the PVC creation form.

  2. On the new form, set the desired scope.

Important

PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

  1. Give the PVC a memorable name like ems2-workspace and add a description if desired.

  2. For the data options, select a new PVC storage class that suits your needs according to the PVC recommended storage classes. In this example, zonal-rwx is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter a capacity that will fit your target dataset—we will specify 10 TB for this example. Select Filesystem as the volume mode. Lastly, set the Container path to /esm2-workspace, where the PVC will be mounted inside containers.

  3. Once the form is complete, click the CREATE DATA SOURCE button at the bottom of the page.

Note

The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.

6.11.3. Creating an Interactive Workspace for Preprocessing

  1. With the PVC created, we can configure the workspace to tell Run:ai how the job should be launched. To do so, go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.

  2. Select the desired project to run your job in.

  3. Leave the Template pane of the form set to Start from scratch.

  4. Enter a descriptive name for your Workspace, such as esm2-preprocessing. Click Continue. After a few seconds the Environment pane of the creation process will appear.

  5. The environment is a template with common settings that can be reused for similar jobs. For example, an environment might be based on a PyTorch container and have Jupyter exposed as a tool enabling anyone who needs to run interactive PyTorch jobs inside a Jupyter notebook to use the same template instead of launching unique jobs and re-configuring the settings each time they want to experiment. In this example, a new environment will be created based on the NeMo container and expose Jupyter as a tool to make it easier to reuse the template in the future.

  6. To create a new environment for the BioNeMo job, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.

  7. In the Environment creation form, enter a name for the environment, such as “bionemo-jupyter” and optionally add a description.

  8. Under Image URL, put nvcr.io/nvidia/clara/bionemo-framework:1.9. This pulls the BioNeMo container from NGC.

    Note

    If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check for updates to the image.

  9. Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.

  10. Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the job, and more.

  11. Since we want to run an interactive Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate, and Container port “8888”.

  12. Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter jupyter-lab as the command, and --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.

  13. Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.

  14. Go to the Compute resource pane and select a single-GPU compute resource for your environment.

  15. In the Data Sources section, select your esm2-workspace PVC.

  16. Go to the bottom of the page and click Create Workspace. This will add your job to the queue and it will automatically get scheduled once cluster resources are available.

  17. After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster and the PVC needs to be provisioned for the first time.

  18. Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.

6.11.4. Using Workspace to Preprocess Data for ESM-2nv

To briefly showcase the model training capabilities of the BioNeMo Framework, we will use the UniRef50 and UniRef90 datasets to provide a diverse yet non-redundant set of protein sequences. By using both, the model can learn from a wide range of sequence variants while avoiding redundancy. This helps capture diverse features and patterns relevant to protein function and structure prediction while preventing overfitting and improving generalization. For demo purposes, a portion of the sample datasets is located in ${BIONEMO_HOME}/examples/tests/test_data/uniref202104_esm2_qc.

  1. Open a terminal tab in your Jupyter lab session.

  2. The data is stored in a zip file, so run the following commands to extract the raw FASTA files and a cluster mapping file into your PVC. You will also save your PVC datapath as DATASET_DIR for future steps.

    export ZIP_FILE=${BIONEMO_HOME}/examples/tests/test_data/uniref202104_esm2_qc_test200_val200.zip
    export DATASET_DIR=/esm2-workspace/test_data
    unzip $ZIP_FILE -d $DATASET_DIR/
    

    The mapping.tsv file is used to associate protein sequences with their respective clusters. This helps to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.

    Using the unzipped contents of this file, we first create the preprocessed /train, /val, and /test folders, organizing protein sequences into batch CSV files. It is important to utilize both datasets if you plan to use ESM-2 as originally created.

    The same approach applies to the clustering mapping file. The ESM2Preprocess class can handle clustering indirectly as part of the dataset preparation process. It leverages UniRef50 to UniRef90 clustering mappings to organize protein sequences, ensuring that data is appropriately clustered for training and validation.

    Please note that this script does not perform clustering itself but relies on pre-defined clustering mappings provided in a TSV file format to organize protein sequences. The expected format is a TSV file where the first column represents the cluster ID (FASTA header in UniRef50), and the second column lists the members separated by commas. The members correspond to entries in the UniRef90 FASTA file.

  3. Now, we can run the preprocessing script. The preprocessing steps are:

    1. Download the dataset from a specified URL or NGC registry.

    2. Extract and decompress the downloaded data if necessary.

    3. Index the FASTA file using pyfastx to facilitate data access.

    4. Split the dataset into training, validation, and test sets.

    5. Convert the FASTA sequences into CSV format, dividing them into multiple files if needed.

    6. Generate additional files like memmaps or sorted FASTA files if required for specific use cases.

    For more details about the preprocessing steps, please consult the ../bionemo/data/preprocess/protein/preprocess.py file and the documentation found here.

    To preprocess the data, use the pretrain.py script and set the do_training parameter to False, as shown below. For more information on the command line and YAML configurations mentioned, see the dropdown below.

    cd ${BIONEMO_HOME}
    
    export TRAIN_UF50_FASTA=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta
    export TRAIN_UF90_FASTA=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200/ur90_ur50_sampler.fasta
    export TRAIN_CLUSTER_MAPPING_TSV=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200/mapping.tsv
    export DATAPATH=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200
    
    python examples/protein/esm2nv/pretrain.py \
    --config-path=conf \
    --config-name=pretrain_esm2_650M \
    ++do_training=False \
    ++do_preprocessing=True \
    ++model.data.val_size=500 \
    ++model.data.test_size=100 \
    ++model.data.train.uf50_datapath=${TRAIN_UF50_FASTA} \
    ++model.data.train.uf90_datapath=${TRAIN_UF90_FASTA} \
    ++model.data.train.cluster_mapping_tsv=${TRAIN_CLUSTER_MAPPING_TSV} \
    ++model.data.dataset_path=${DATAPATH}
    
    Command Line and YAML Configurations

    Parameters starting with -- are passed as command line arguments to pretrain.py. These parameters include:

    --config-path and --config-name: These specify the folder and the YAML file name for the configuration. The path is relative to pretrain.py. For instance:

    config-path: Refers to the configuration folder, e.g., examples/protein/esm2nv/conf.

    config-name: Refers to the YAML configuration file, e.g., pretrain_esm2_650M.yaml.

    The full path for the configuration file in this example would be: {BIONEMO_HOME}/examples/protein/esm2nv/conf/pretrain_esm2_650M.yaml. Parameters starting with ++ are configurable within the YAML file. Below are some examples of such parameters found in the pretrain_esm2_650M.yaml file, which inherits from base_config.yaml:

    do_training : Set to False if you only want to preprocess the data without initiating training.

    model.data.val_size and model.data.test_size : These specify the sizes of the validation and test datasets, respectively.

    model.data.train.uf50_datapath : Specifies the path to the UniRef50 FASTA file.

    model.data.train.uf90_datapath : Specifies the path to the UniRef90 FASTA file.

    model.data.train.cluster_mapping_tsv : Specifies the path to the mapping file that maps UniRef50 clusters to UniRef90 sequences.

    model.data.dataset_path : Specifies the path to the output directory for the preprocessed UniRef50 and UniRef90 data. After processing, the following directories will be created:

    uf50 : Contains train/test/val splits, each with files like x000.csv.

    uf90 : Contains a folder named uf90_csvs, with files like x000.csv. Note that there will be no train/test/val splits in this directory, as UniRef90 is only used during training.

    Changes can also be made directly to the YAML file instead of overwriting arguments through the command line.

    Now we are ready for pretraining on a single node. For multi-node training, continue with the following step.

  4. For multi-node training, we need to create a run script in our PVC. We’ll launch a run.sh script, which will run the pretrain.py script using torchrun, a tool that runs a multi-process application where each process is assigned a unique rank. The PyTorch Training Operator available on DGX Cloud coordinates with torchrun to automatically set the RANK, LOCAL_RANK, and WORLD_SIZE, among other environment variables, based on the total number of GPUs allocated for a job. Using the terminal tab, create the run script in your PVC. This code will run a two-node distributed training job. To change the number of nodes, modify the ++trainer.num_nodes configuration.

    cat << EOF > /esm2-workspace/run.sh
    
    #!/bin/bash
    
    cd \${BIONEMO_HOME}/examples/protein/esm2nv/ && torchrun pretrain.py \
    --config-path=conf \
    --config-name=pretrain_esm2_650M do_training=True ++trainer.max_steps=1000 ++trainer.val_check_interval=100 \
    ++trainer.devices=8 \
    ++trainer.num_nodes=2 \
    ++model.data.dataset_path=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200 \
    ++model.micro_batch_size=2 \
    ++trainer.val_check_interval=12 \
    ++exp_manager.create_wandb_logger=False \
    ++exp_manager.exp_dir=/esm2-workspace/results
    
    EOF
    

    Note

    To continue training from an existing checkpoint, use exp_manager.resume_if_exists=True.

    To use Weights and Biases for experiment management and training visualization, set ++exp_manager.create_wandb_logger=True. Your WANDB_API_KEY environment variable will also need to be added when creating your workload to allow WandB to login. Refer to Using WandB with a Workspace.

6.11.5. Pretraining

Now we will perform pretraining of ESM-2 from scratch using our prepared data and the parameters provided in the pretrain_esm2_650M.yaml config file located in the ${BIONEMO_HOME}/examples/protein/esm2nv/conf folder. This can be done inside the Jupyter notebook or using distributed training.

6.11.5.1. Single GPU Training in Jupyter Notebook

For this demo example, we will shorten the time required for training by setting the following parameters: ++trainer.max_steps=1 and ++val_check_interval=1. Users can update these parameters by editing the .yaml config file or by overriding config arguments at runtime using Hydra, as shown in the example below.

python examples/protein/esm2nv/pretrain.py \
--config-path=conf \
--config-name=pretrain_esm2_650M \
name=esm2_pretrain \
++do_training=True \
++model.data.dataset_path=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200 \
++exp_manager.exp_dir=/esm2-workspace \
++trainer.devices=1 \
++model.micro_batch_size=1 \
++trainer.max_steps=1 \
++trainer.val_check_interval=1 \
++exp_manager.create_wandb_logger=False \
++trainer.limit_train_batches=1 \
++trainer.limit_val_batches=1

You can change the configuration used to pretrain_esm2_8M if you have hardware constraints. See explanations for the YAML configuration below.

YAML Configuration

trainer.devices : Specifies the number of GPUs per node to use.

trainer.max_steps : Sets the maximum number of training steps.

trainer.val_check_interval : Determines how often to run validation.

trainer.limit_train_batches and trainer.limit_val_batches : Limit the number of batches for training and validation, respectively.

model.micro_batch_size : Refers to the number of samples processed per rank in a single forward/backward pass before performing a weight update.

6.11.5.2. Multi-Node Distributed Pretraining

  1. To create the training job, navigate to the Workloads page and click New Workload > Training.

  2. On the project creation form page, select the desired project to run the job on.

  3. Select Distributed for the workload architecture to run a multi-node job. This adds a drop-down menu which lists the available frameworks for running distributed jobs. Since we are running a PyTorch job, select the PyTorch framework.

  4. For the distributed training configuration, there is an option for either Workers & master or Workers only.

    Note

    In this documentation, we will use the term ‘primary’ instead of ‘master’ to align with modern terminology. Please note that the UI and commands may still refer to it as ‘master’.

  5. A primary node is typically required when nodes need to coordinate with one another, such as in a job that does an all_reduce operation. The primary node can either be identical to the workers and run the training process or be a lightweight pod that only does coordination. If a job doesn’t need coordination between processes, the Workers only option can be used. For our example, we want the pods to coordinate with each other. Select the Workers & master option.

  6. Ensure the Start from scratch option is selected for the template.

  7. Give the job a name like ems2nv-pretraining and click Continue.

  8. In the new environment form that opens, select New environment. In the new form, provide a name for the environment, such as “distributed-bionemo-env”, and optionally add a description.

  9. Under Image URL, put nvcr.io/nvidia/clara/bionemo-framework:1.9. This pulls the BioNeMo container from NGC. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check for updates to the image.

  10. Most of the remaining settings on the form can be kept as default. We will change the runtime settings in the worker to keep the environment flexible for future use.

  11. After selecting the Create Environment button, you will be returned to the job creation page with the new environment selected.

  12. In the Commands and Arguments field, enter sh /esm2-workspace/run.sh for the command. This will run the distributed training job using the run script we created during preprocessing.

  13. In the compute resource pane, select the desired number of workers. Note that since we are using a primary and worker configuration where we want the primary to participate in the training, we will specify one fewer worker node than the total number of nodes expected for the job. In other words, if we want to run an 8-node job where the primary will be coordinating processes as well as training, we will specify seven workers and one primary (which is the next form). In our example, we will train with two total nodes. As a result, select one worker in the form.

  14. For the compute resource, we want to train on a full DGX node with all available GPU resources. Select the compute resource in your project with a GPU capacity of eight.

  15. In the Data Sources pane, select the name of the PVC that was created in the previous section. The name of the PVC in this example is esm2-workspace. Click Continue to navigate to the primary configuration page.

  16. As mentioned, the primary pod can have a different configuration from the worker pods for distributed PyTorch jobs. For our example, we want to use the same setup for workers and the primary so ensure the Allow different setup for the master radio button is deselected.

  17. Once finished, click Create Training to queue the training job.

After creating the training workload, the job will be scheduled with Run:ai and launched once resources become available. The job will appear in the Run:ai Workloads page after submission. After completion, the model checkpoints and results will be stored in your PVC for future use in other downsteam tasks such as fine-tuning or inference. For more information, refer to ESM-2nv Model Overview.