6. Workload Examples
In this section, we give instructions for running several Workloads on your Run:ai on DGX Cloud cluster. The examples are not exhaustive, but can be adapted for your own workloads.
The following images are used in the workload examples and have been tested and validated as functional. Any other images are un-tested and may require adjustments to environment variables or additional debugging to function optimally.
nvcr.io/nvidia/nemo:24.05.01
nvcr.io/nvidia/nemo:24.05
nvcr.io/nvidia/nemo:24.07
nvcr.io/nvidia/pytorch:24.06-py3
nvcr.io/nvidia/pytorch:24.07-py3
nvcr.io/nvidia/pytorch:24.08-py3
nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda11.8-py3.9
6.1. Interactive NeMo Workload Job
In this example, we step through the process of creating an interactive workload using the NeMo container from NGC. Interactive workloads in Run:ai are called Workspaces. In this particular example, we will run a Jupyter notebook to fine-tune an LLM (Llama3-8B Instruct) using LoRA against the PubMedQA dataset.
6.1.1. Prerequisites and Requirements
The following are required before running the interactive NeMo job:
You must have accepted an invitation to your NGC org and added your NGC credentials to the Run:ai. See details here.
You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.
Your user must be able to access a project and department.
At least one A100 or newer GPU with 80 GB of memory.
You must have access to a compute resource with at least one GPU created in your scope that you can use.
You must create a Hugging Face account and agree to the Meta Llama 3 Community License Agreement while signed in to your Hugging Face account. You must then generate a Hugging Face read access token in your account settings. This token is required to access the Llama3 model in the Jupyter Notebook.
6.1.2. Creating the Data Source
To make it easier to reuse code and checkpoints in future jobs, a persistent volume chain (PVC) is created as a data source. The PVC can be mounted in jobs and will persist after the job completes, allowing any generated data to be reused.
To create a new PVC, go to the Data Sources page. Click New Data Source and then PVC to open the PVC creation form.
On the new form, set the desired scope.
Important
PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.
Give the PVC a memorable name like
nemo-lora-checkpoints
and add a description if desired.For the data options, select a new PVC storage class that suits your needs according to the PVC recommendations here. In this example,
zonal-rwx
is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter10 TB
for the size to ensure we have plenty of capacity for future jobs. Select Filesystem as the volume mode. Lastly, set the Container path to/checkpoints
, where the PVC will be mounted inside containers. This completed section should look as shown below.Once the form is complete, create the data source using the button at the bottom of the page.
Note
The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.
6.1.3. Creating the Workspace
With the PVC created, we can configure the workspace to tell Run:ai how the job should be launched. To do so, go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.
Select the desired project to run your job in.
Leave the Template pane of the form set to Start from scratch.
Enter a descriptive name for your Workspace, such as
nemo-pubmedqa-lora
. Click Continue. After a few seconds the Environment pane of the creation process will appear.The environment is a template with common settings that can be reused for similar jobs. For example, an environment might be based on a PyTorch container and have Jupyter exposed as a tool enabling anyone who needs to run interactive PyTorch jobs inside a Jupyter notebook to use the same template instead of launching unique jobs and re-configuring the settings each time they want to experiment. In this example, a new environment will be created based on the NeMo container and expose Jupyter as a tool to make it easier to reuse the template in the future.
To create a new environment for the NeMo job, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.
In the Environment creation form, enter a name for the environment, such as “nemo-jupyter” and optionally add a description.
Under Image URL, put
nvcr.io/nvidia/nemo:24.07
. This pulls the NeMo container from NGC. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check if there are updates to the image.Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.
Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the job, and more.
Since we want to run an interactive Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate and Container port “8888”.
Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter
jupyter-lab
as the command, and--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser
for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.For the directory, specify
/opt/NeMo
as the directory to launch the job in. This will launch the Jupyter lab session inside the/opt/NeMo
directory inside the container.Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.
Go to the Compute resource pane and select a single-GPU compute resource for your environment. While it is possible to use more GPUs for LoRA, most tasks can be completed with just a single Ampere or newer GPU with 80GBs of memory in a relatively short period.
In the Data Sources section, select your
nemo-lora-checkpoints
PVC.Go to the bottom of the page and click Create Workspace. This will add your job to the queue and it will automatically get scheduled once cluster resources are available.
After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster and the PVC needs to be provisioned for the first time.
Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.
6.1.4. Running the Notebook
Once connected to the Jupyter lab session, navigate to the example notebook by opening to the tutorials/llm/llama-3
directory inside the /opt/NeMo
folder the container automatically opens to. Inside this directory are a couple of notebooks, including llama3-lora-nemofw.ipynb
, which is the one we will be following. Double-click that file to open the notebook on the main screen.
The notebook walks through what each step does, but at a high level, it does the following:
Downloads the Llama3-8B Instruct model from Hugging Face
Converts the model to the .nemo format to be easily ingested by the NeMo toolkit
Downloads and preprocesses the PubMedQA dataset
Fine-tunes the Llama3-8B Instruct model using the LoRA PEFT technique against the PubMedQA dataset
Deploys the model for local inference
Evaluates the model against the test dataset to determine the overall accuracy of the fine-tuned model
The notebook can be stepped through cell-by-cell or run all at once, and the results viewed later. The only important note is that cell 2 requires you to add your Hugging Face access token, as listed in the requirements section in the cell text, in order to authenticate with Hugging Face to download the model.
As the notebook runs, you can navigate back to the Run:ai workloads page, where your job now shows metrics based on your resource usage. The metrics tab shows the GPU and CPU compute and memory usage during the job. This is a great resource for determining how your job is performing.
6.1.5. Persisting the Model to Storage
As the PVC was added to the workspace and is mounted at /checkpoints
inside the container, any files or directories that are saved to /checkpoints
inside the container will be saved in the PVC and available in future jobs.
To save the generated model to the PVC, open a terminal tab inside the Jupyter lab session and navigate to the notebook directory with:
cd /opt/NeMo/tutorials/llm/llama-3
The model is saved at
results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo
. It can be verified with:1 $ du -sh results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo 2 21M results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo
To copy the file to the PVC, run:
cp results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo /checkpoints
Once the checkpoint is copied, it will be available in all future workloads that use that same PVC unless the file is intentionally removed.
6.1.6. Cleaning up the Environment
After the job finishes, it can be deleted to free up resources for other jobs. Note that certain jobs do not self-terminate and should be monitored to ensure resources are not idle for completed jobs.
6.2. RAPIDS and Polars Workspace
In this example, we step through the process of creating another Workspace, which uses the RAPIDS container from NGC. We will launch a Jupyter environment to experiment with the new integration between RAPIDS and Polars.
6.2.1. Prerequisites and Requirements
The following are required before running the RAPIDS workspace:
You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.
Your user must be able to access a project and department.
At least one idle GPU within your DGX Cloud cluster.
You must have access to a compute resource with at least one GPU created in your scope that you can use.
6.2.2. Creating the Workspace
Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.
Select the desired project to run your job in.
Leave the Template pane of the form set to Start from scratch.
Enter a descriptive name for your Workspace, such as
rapids-polars
. Click Continue. After a few seconds the Environment pane of the creation process will appear.To create a new environment for the RAPIDS workspace, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.
In the Environment creation form, enter a name for the environment, such as “rapids-jupyter” and optionally add a description.
Under Image URL, put
nvcr.io/nvidia/rapidsai/notebooks:24.08-cuda11.8-py3.9
. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases but if you are pushing new containers to the NGC org with the same tag “always pull the image from the registry” should be used to check if there are updates to the image.Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.
Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the workspace, and more.
Since we want to run a Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate and Container port “8888”.
Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter
jupyter-lab
as the command, and--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser
for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.Do not modify any other settings.
Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.
Go to the Compute resource pane and select a single GPU compute resource for your environment.
Go to the bottom of the page and click Create Workspace. This will add your job to the queue and it will automatically get scheduled once cluster resources are available.
After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your workspace is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster.
Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload, and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.
6.2.3. Downloading the Polars Notebook
Once connected to the Jupyter lab session, you can navigate to the integrated RAPIDS notebook samples within the notebooks
directory and experiment within your workspace.
To get started downloading the GPU-accelerated Polars example notebook, open a terminal interface by clicking the Terminal box.
In the resulting terminal, run the following command to download the new notebook.
wget https://raw.githubusercontent.com/rapidsai-community/showcase/refs/heads/main/accelerated_data_processing_examples/polars_gpu_engine_demo.ipynb
6.2.4. Running the Polars Notebook
The notebook provides an introduction to what Polars is and where it is best used, along with code to:
Install Polars with GPU Engine, along with other notebook dependencies
Download a sample Kaggle dataset
Initialize Polars to use the GPU in DGX Cloud
Run various sample analyses to provide performance comparisons between Polars with CPU only and Polars with GPU
To open the notebook, double-click on the file named polars_gpu_engine_demo.ipynb when it becomes visible in the Jupyter file navigation pane.
The notebook can be stepped through cell-by-cell or run all at once, and the results can be viewed later.
As the notebook is running, you can navigate back to the Run:ai workloads page, where your job now shows metrics based on your resource usage. The metrics tab shows the GPU and CPU compute and memory usage during the job. This is a great resource for determining how your job is performing. Within the Jupyter notebook page you are using, you can also use an integrated GPU performance visualization plugin from the RAPIDS NGC container (it is the tab on the left side of the page with a GPU icon on it).
6.2.5. Cleaning up the Environment
After you are finished with the workload, it can be stopped or deleted to free up resources for other workloads. Note that certain workloads do not self-terminate and workloads should be monitored to ensure resources are not sitting idle for completed workloads.
6.3. Download Data From S3 in a Training Workload
In this example, we will create a new Workload to download data from a private S3 bucket. We will use a Training workload due to the noninteractive nature of this example. We will configure the workload with private information to access the S3 bucket and execute the commands necessary to install and run the s5cmd
tool to download the data from that bucket to a PVC.
6.3.1. Prerequisites and Requirements
The following are required before running the data download Training workload:
You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.
Your user must be able to access a project and department.
You must have access to a compute resource in your scope that you can use—it can be CPU-only, which is ideal to avoid using GPU resources when it is unnecessary.
You must have a private S3 bucket with data in it.
You must have the necessary permissions to download that data from S3 via access key ID and secret access key.
6.3.2. Creating the Credentials
To securely insert the necessary keys into the workload, we will create a Credential for the access key ID and the secret access key, allowing access to our S3 bucket.
To create a new Credential, go to the Credentials page. Click + NEW CREDENTIALS and then Access key to open the access key creation form.
On the form, set the desired scope. The scope should be as narrow as possible—in this case, set the scope to match the targeted project for your workload.
Give the access key a descriptive name, such as
aws-access-keys
, and add a description if desired.In the Secret pane, select New secret. Set the Access key field to your AWS access key ID, and the Access secret field to your AWS secret access key.
Click CREATE CREDENTIALS to save the new Credential.
6.3.3. Creating the Data Source
We will create a PVC Data Source to allow the resulting dataset to be shared with future workloads.
To create a new PVC, go to the Data Sources page. Click New Data Source and then PVC to open the PVC creation form.
On the new form, set the desired scope.
Important
PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.
Give the PVC a memorable name like
s3-dataset
and add a description if desired.For the data options, select a new PVC storage class that suits your needs according to the PVC recommended storage classes. In this example,
zonal-rwx
is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter a capacity that will fit your target dataset—we will specify10 TB
for this example. Select Filesystem as the volume mode. Lastly, set the Container path to/dataset
, where the PVC will be mounted inside containers.Once the form is complete, click the CREATE DATA SOURCE button at the bottom of the page.
Note
The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.
6.3.4. Creating the Training Workload
Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Training. You will be taken to the New training creation page.
Select the desired project to run your job in.
Leave the Template pane of the form set to Start from scratch.
Enter a descriptive name for your workload, such as
s3-download
. Click Continue. After a few seconds, the Environment pane of the creation process will appear.To create a new environment for the workload, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.
In the Environment creation form, enter a name for the environment, such as
s3-download-env
, and optionally add a description.Under Image URL, put
nvcr.io/nvidia/pytorch:24.08-py3
. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check for updates to the image.Under the Workload architecture & type pane, select Standard and Training if they are not selected already. These options are likely already selected and grayed out.
Do not modify any other settings, including the Runtime settings section—we will fill that in for the specific workload, keeping this environment multipurpose for future workloads.
Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New training creation page. The environment you just created should now be selected.
Open the Runtime settings section for this training workload. We will execute a sequence of several commands within this workload. Set the Command field to
bash -c
, and the Arguments field to"wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz; tar -xf s5cmd_2.2.2_Linux-64bit.tar.gz; ./s5cmd cp s3://your-s3-bucket-name-here/* /dataset"
To embed the Credentials into the workload, click the + ENVIRONMENT VARIABLE button twice.
For the first entry’s Name field, enter
AWS_ACCESS_KEY_ID
, set the Source toCredentials
, set the Credentials name field toaws-access-keys
, and set the Secret Key field toAccessKeyId
.For the second entry’s Name field, enter
AWS_SECRET_ACCESS_KEY
, set the Source toCredentials
, set the Credentials name field toaws-access-keys
, and set the Secret Key field toSecretKey
.Go to the Compute resource pane and select your desired compute resource—ideally a CPU-only resource to avoid consuming GPU resources.
In the Data Sources section, select your
s3-dataset
PVC.Go to the bottom of the page and click CREATE TRAINING. This will add your workload to the queue and it will automatically get scheduled once cluster resources are available.
After creating the training workload, you will be taken to the workload overview page, where you can view the status of your workload. Your training workload is running once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster.
Once the status is “Completed”, you can review the outcome by clicking the workload and selecting the SHOW DETAILS button on the right side of the blue banner. Select the LOGS tab in the workload’s detail pane—it should indicate that data is being copied to your new PVC from your S3 bucket.
6.3.5. Cleaning up the Environment
After the job finishes, it can be deleted to free up resources for other jobs. Note that certain jobs do not self-terminate and should be monitored to ensure resources are not idle for completed jobs.
6.4. Using Your Data From Google Cloud Storage
There are two ways to use your data residing in Google or other clouds:
Copy the data into the DGX Cloud environment, from Google Cloud Storage (GCS) or other cloud, onto the shared store NFS filesystems.
Access the data directly from services within the cloud provider. We recommend running this job as a pure Kubernetes job at this time.
Here are examples of the following scenarios:
6.4.1. Example: Using GCS to copy data to PVC within Run:ai
In GCP, create a custom GCP role with the required permissions.
The following is a minimal set of permissions, which may be created by cloning the Storage Object Viewer role and adding the
storage.buckets.get
permission.1 storage.buckets.get 2 storage.buckets.getIamPolicy 3 storage.buckets.listEffectiveTags 4 storage.buckets.listTagBindings 5 storage.managedFolders.create 6 storage.managedFolders.get 7 storage.managedFolders.list 8 storage.objects.list
Create a GCP service account (SA).
Assign the custom role to the service account on the target bucket(s).
Create a service account key for the SA created and download the .json key file. Protect the key file as it contains your private key.
Create a Kubernetes secret with the SA key in your Run:ai project (namespace) using the following command.
Note
You must have your kubeconfig set to the Run:ai researcher configuration provided by DGX Cloud.
kubectl create secret generic google-gcs-creds --from-file=key=projectID-deadbeef.json -n runai-<Run:ai projectName>
Within Run:ai, create a data source as a PVC using the following sequence:
Specify the scope, name, and an optional description for the PVC.
In the Data Mount section, select New PVC. Choose a storage class, access mode, and always Filesystem. Then, set a default mount path within any containers.
Note
The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.
Use the
kubectl get pvc
command to determine the actual name of the PVC created by the Run:ai console. The actual name will be similar toname-for-pvc-<cluster-identifier>-<digits>
. Use this for the next step.Create a Kubernetes manifest YAML utilizing the Google Cloud command line to copy the data from GCS into the PVC. The example below uses the secret you created and the PVC as mounted volumes to the container. The ProjectID and Bucket name are passed to a
gcloud rsync
via environment variables.1 apiVersion: v1 2 kind: Pod 3 metadata: 4 name: gcp-gcs-copy-2-pv-podd 5 namespace: runai-<Run:ai projectName> 6 spec: 7 volumes: 8 - name: pv-storage 9 persistentVolumeClaim: 10 claimName: name-for-pvc 11 - name: gcp-sa-credential 12 secret: 13 secretName: google-gcs-creds 14 restartPolicy: OnFailure 15 schedulerName: runai-scheduler 16 containers: 17 - name: gcp-cli-container 18 image: google/cloud-sdk 19 imagePullPolicy: IfNotPresent 20 env: 21 - name: PROJECTNAME 22 value: projectID 23 - name: BUCKET 24 value: bucket-name 25 volumeMounts: 26 - mountPath: "/data/shared" 27 name: pv-storage 28 - name: gcp-sa-credential 29 mountPath: /etc/gcp_cred 30 readOnly: true 31 command: ["sh","-c"] 32 args: 33 - gcloud auth activate-service-account --key-file=/etc/gcp_cred/key; 34 gcloud storage rsync -r gs://${BUCKET} /data/shared/ --project=${PROJECTNAME};
Submit the Kubernetes manifest for execution in the DGX Cloud environment using the following command:
kubectl apply -f mainifast.yaml -n runai-<Run:ai ProjectName>
.Monitor the job for completion using
kubectl get pod gcp-gcs-copy-2-pv-podd
You can now use the PVC in your Run:ai jobs.
6.4.2. Example: Using data directly within Google Cloud Services
Run:ai training jobs can directly access data in Google Cloud storage. This section gives an example of how to do this in a Python job, and the small configuration steps needed to make it happen.
The following Python code demonstrates how to read all text files from a bucket into a list of strings. While this approach may not be practical for AI training workloads, it will help illustrate the necessary configurations.
1 from google.cloud import storage
2 import argparse
3 import os
4
5 def get_bucket_name():
6 parser = argparse.ArgumentParser(description="Read text files from a Google Cloud Storage bucket.")
7 parser.add_argument("bucket_name", help="The name of the bucket to read from.", nargs='?')
8 args = parser.parse_args()
9
10 if args.bucket_name:
11 return args.bucket_name
12 elif "BUCKET" in os.environ:
13 return os.environ["BUCKET"]
14 else:
15 print("Error: Bucket name not provided. Please provide it as a command line argument or set the BUCKET environment variable.")
16 return None
17
18 def read_bucket_files(bucket_name):
19 storage_client = storage.Client()
20 bucket = storage_client.bucket(bucket_name)
21
22 file_contents = []
23 blobs = bucket.list_blobs()
24 for blob in blobs:
25 if blob.name.endswith(".txt"):
26 # Download the file content
27 content = blob.download_as_string().decode("utf-8")
28 file_contents.append(content)
29
30 return file_contents
31
32 if __name__ == "__main__":
33 bucket_name = get_bucket_name()
34 if bucket_name:
35 file_contents = read_bucket_files(bucket_name)
36 else:
37 exit(1)
38
39 for content in file_contents:
40 print(content)
To execute the above code in a containerized training job on Kubernetes, you will need to do the following:
Include the Google Cloud Storage Python libraries into the container.
Inject a secret into the container, allowing proper authorization to the GCS API.
To install the google-cloud-storage
Python library, use pip
to install it into the container image. For example, use the following Dockerfile:
1 FROM python:3.12
2
3 RUN pip install google-cloud-storage
4 RUN mkdir /app
5 COPY read.py /app/
6
7 ENTRYPOINT ["python"]
8 CMD ["/app/read.py"]
To inject the secret into the container, use a Kubernetes secret mounted as a file with the environment variable GOOGLE_APPLICATION_CREDENTIALS
pointed at the file. The example below uses the same secret as the one created above.
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: gcs-reader
5 spec:
6 containers:
7 - name: gcs-reader
8 image: nvcr.io/<orgID>/gcs-read:latest
9 imagePullPolicy: IfNotPresent
10 command: ["python", "/app/read.py"]
11 env:
12 - name: BUCKET
13 value: bucket-name
14 - name: GOOGLE_APPLICATION_CREDENTIALS
15 value: /etc/gcp_cred/key
16 volumeMounts:
17 - name: gcs-creds
18 mountPath: /etc/gcp_cred
19 readOnly: true
20 imagePullSecrets:
21 - name: nvcr.io-creds
22 volumes:
23 - name: gcs-creds
24 secret:
25 secretName: google-gcs-creds
26 restartPolicy: OnFailure
Note
Currently, there isn’t a method to mount Kubernetes secrets as files in Run:ai jobs. Run.ai supports only environment variables. Since the Google libraries require a token file (which should be a secret), you will need workarounds to use them with Run:ai jobs. The recommended workarounds are:
only use Kubernetes native jobs for data manipulation, or
copy the GCP service account key into shared storage and point the
GOOGLE_APPLICATION_CREDENTIALS
environmental variable at that shared storage location.
6.4.3. Example: Reading data from Google BigQuery
You can use data from Google BigQuery directly in your training jobs on Run:ai. This example shows the Python script and the small configuration code needed.
In GCP, the BigQuery Data Viewer role contains the necessary permissions and may be assigned at the table, dataset or project levels. Following the principle of least privilege, NVIDIA recommends assigning the role at the Table level, unless you need to read multiple tables from the dataset or project. The following is a Python code example for a BigQuery reader.
1 import argparse
2 import os
3 from google.cloud import bigquery
4
5 def read_bigquery_table(project, dataset, table):
6 client = bigquery.Client()
7 table_ref = client.dataset(dataset).table(table)
8 table = client.get_table(table_ref)
9
10 rows = client.list_rows(table)
11 return rows
12
13 if __name__ == "__main__":
14 parser = argparse.ArgumentParser(description='Read a BigQuery table.')
15 parser.add_argument('--project', help='The project ID.')
16 parser.add_argument('--dataset', help='The dataset ID.')
17 parser.add_argument('--table', help='The table ID.')
18
19 args = parser.parse_args()
20
21 project = args.project or os.environ.get('PROJECTNAME')
22 dataset = args.dataset or os.environ.get('DATASET')
23 table = args.table or os.environ.get('TABLENAME')
24
25 if not project or not dataset or not table:
26 print("Error: Missing project, dataset, or table name.")
27 exit(1)
28
29 rows = read_bigquery_table(project, dataset, table)
30
31 for row in rows:
32 print(row)
Include the above code in a container using the following Dockerfile:
1 FROM python:3.12
2
3 RUN pip install google-cloud-bigquery
4 RUN mkdir /app
5 COPY read-bq.py /app/
6
7 ENTRYPOINT ["python"]
8 CMD ["/app/read-bq.py"]
Finally, you can start a job/pod with the following Kubernetes manifest:
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: read-bq
5 spec:
6 containers:
7 - name: read-bq
8 image: nvcr.io/<orgID>/read-from-bq:latest
9 imagePullPolicy: Always
10 command: ["python", "/app/read-bq.py"]
11 env:
12 - name: PROJECTNAME
13 value: project
14 - name: DATASET
15 value: austin_training
16 - name: TABLENAME
17 value: example
18 - name: GOOGLE_APPLICATION_CREDENTIALS
19 value: /etc/gcp_cred/key
20 volumeMounts:
21 - name: gcs-creds
22 mountPath: /etc/gcp_cred
23 readOnly: true
24 imagePullSecrets:
25 - name: nvcr.io-creds
26 volumes:
27 - name: gcs-creds
28 secret:
29 secretName: google-gcs-creds
30 restartPolicy: OnFailure
6.4.4. Alternative to using Service Account Keys
Service account keys are powerful credentials, and can present a security risk if they are not managed correctly. Workload Identity Federation eliminates the maintenance and security burden associated with service account keys. With Workload Identity Federation, you can use Identity and Access Management (IAM) to grant external identities IAM roles, and direct access on Google Cloud resources. You can also grant access through service account impersonation.
As part of environment onboarding, NVIDIA will provide you with the OIDC issuer URL for the environment. Navigate to your organization’s Google Cloud Console, then go to IAM & Admin > Workload Identity Federation. Click the CREATE POOL button to proceed.
In the subsequent workflow wizard, provide a name for the pool and an optional description and select OpenID Connect (OIDC) as the provider.
After selecting the provider type, enter “dgx-cloud” as the provider name and paste the Issuer URL provided with your environment into the Issuer (URL) field. Under Audiences, select Default audience.
Lastly, add a mapping google.subject
to assertion.sub
in the Configure provider attributes step.
Once complete, the Workload Identity Pool will confirm that it can communicate with the “dgx-cloud” Kubernetes cluster. You should see a green checkmark in the upper left corner near the provider you created, as shown below.
Then, click CONNECTED SERVICE ACCOUNTS in the upper right corner and then select the DOWNLOAD CONFIG button. When the Configure your application dialog comes up, select the provider you created and then paste the path
/var/run/service-account/token
into the OIDC ID token path field, leaving the format as “text” and then click DOWNLOAD CONFIG to download the required configuration file.
After saving the configuration file, click DISMISS. On the main Identity Federation Pool page, locate and copy the value in the IAM Principal
field. It will look similar to:
principal://iam.googleapis.com/projects/1091797687346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/SUBJECT_ATTRIBUTE_VALUE
where SUBJECT_ATTRIBUTE_VALUE
represents the Kubernetes identifier.
For example, if the service account is default
in the Run:ai project runai-projectname
and we substitute system:serviceaccount:runai-projectname:default
as the SUBJECT_ATTRIBUTE_VALUE
, the overall IAM Principal will be:
principal://iam.googleapis.com/projects/1068792787346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/system:serviceaccount:runai-projectname:default``
This is the IAM Principal to use (within GCP IAM) to grant access to workloads in that Run:ai project as shown below:
1 gcloud projects add-iam-policy-binding projects/PROJECT_ID \
2 --role=roles/container.clusterViewer \
3 --member=principal://iam.googleapis.com/projects/1068792787346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/system:serviceaccount:runai-projectname:default \
4 --condition=None
Or with the Cloud Console UI:
Once IAM permissions have been provided, a ConfigMap should be constructed on the “dgx-cloud” Kubernetes cluster, with the configuration file you downloaded using the following command.
kubectl create configmap google-wi --from-file clientLibraryConifg-dgx-cloud.json --namespace runai-projectname
Once the ConfigMap has been created, each workload should be augmented with three items, a volume projecting the SA token, a volume mapping the ConfigMap, and an environment variable pointing to the ConfigMap projection. The following example manifest shows a simple container with the three requirements.
Note
The audience property in the serviceAccountToken
below may be copied from the clientLibraryConifg file audience
field (be sure to add the https:
).
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: oidc-example
5 spec:
6 containers:
7 - name: example
8 image: google/cloud-sdk
9 command: ["/bin/sh", "-c", "gcloud auth login --cred-file $GOOGLE_APPLICATION_CREDENTIALS && sleep infinity"]
10 volumeMounts:
11 - name: token
12 mountPath: "/var/run/service-account"
13 readOnly: true
14 - name: workload-identity-credential-configuration
15 mountPath: "/etc/workload-identity"
16 readOnly: true
17 env:
18 - name: GOOGLE_APPLICATION_CREDENTIALS
19 value: "/etc/workload-identity/clientLibraryConifg-dgx-cloud.json"
20 volumes:
21 - name: token
22 projected:
23 sources:
24 - serviceAccountToken:
25 audience: https://iam.googleapis.com/projects/<Your PROJECT_ID>/locations/global/workloadIdentityPools/<Your WORKLOAD_POOL>/providers/<Your WORKLOAD_PROVIDER>
26 expirationSeconds: 3600
27 path: token
28 - name: workload-identity-credential-configuration
29 configMap:
30 name: cloud-ce-config
To confirm the configurations, run kubectl exec -it oidc-example -n runai-projectname -- bash
to access the container and gcloud config list
to confirm the configuration.
The GCP resources configured in IAM should be accessible using either gcloud
or gsutil
commands.
6.5. Using Your Data From Amazon S3
There are two ways to use your data residing in AWS or other clouds:
Copy the data into the DGX Cloud environment, from Amazon S3 or other cloud storage, onto the shared store NFS filesystems.
Access the data directly from services within the cloud provider. We recommend running this job as a pure Kubernetes job at this time.
Here are examples of the following scenarios:
6.5.1. Example: Using S3 to copy data to PVC within Run:ai
In AWS, create a custom AWS IAM role with the required permissions.
The following is a minimal set of permissions, which may be created by granting S3 read access.
1 s3:GetObject 2 s3:ListBucket 3 s3:GetBucketPolicy 4 s3:PutObject
Create an AWS IAM user or role.
Assign the custom role to the IAM user or role to allow access to the target S3 bucket(s).
Create an access key for the IAM user and download the .csv key file. Protect the key file as it contains your private credentials.
Create a Kubernetes secret with the IAM access keys in your Run:ai project (namespace) using the following command.
Note
You must have your kubeconfig set to the Run:ai researcher configuration provided by DGX Cloud.
kubectl create secret generic aws-s3-creds --from-literal=aws_access_key_id=<ACCESS_KEY_ID> --from-literal=aws_secret_access_key=<SECRET_ACCESS_KEY> -n runai-<Run:ai projectName>
Within Run:ai, create a data source as a PVC using the following sequence:
Specify the scope, name, and an optional description for the PVC.
In the Data Mount section, select New PVC. Choose a storage class, access mode, and always Filesystem. Then, set a default mount path within any containers.
Note
The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.
Use the
kubectl get pvc
command to determine the actual name of the PVC created by the Run:ai console. The actual name will be similar toname-for-pvc-<cluster-identifier>-<digits>
. Use this for the next step.Create a Kubernetes manifest YAML utilizing the AWS CLI to copy the data from S3 into the PVC. The example below uses the secret you created and the PVC as mounted volumes to the container. The bucket name is passed to an
aws s3 sync
command via environment variables.1 apiVersion: v1 2 kind: Pod 3 metadata: 4 name: aws-s3-copy-2-pv-pod 5 namespace: runai-<Run:ai projectName> 6 spec: 7 volumes: 8 - name: pv-storage 9 persistentVolumeClaim: 10 claimName: name-for-pvc 11 - name: aws-credential 12 secret: 13 secretName: aws-s3-creds 14 restartPolicy: OnFailure 15 schedulerName: runai-scheduler 16 containers: 17 - name: aws-cli-container 18 image: amazon/aws-cli 19 imagePullPolicy: IfNotPresent 20 env: 21 - name: BUCKET 22 value: <bucket-name> 23 volumeMounts: 24 - mountPath: "/data/shared" 25 name: pv-storage 26 - name: aws-credential 27 mountPath: /root/.aws/ 28 readOnly: true 29 command: ["sh","-c"] 30 args: 31 - aws s3 sync s3://${BUCKET} /data/shared/;
Submit the Kubernetes manifest for execution in the DGX Cloud environment using the following command:
kubectl apply -f manifest.yaml -n runai-<Run:ai ProjectName>
.Monitor the job for completion using
kubectl get pod aws-s3-copy-2-pv-pod
.
You can now use the PVC in your Run:ai jobs.
6.5.2. Example: Using data directly within AWS Services
Run:ai training jobs can directly access data in Amazon S3. This section gives an example of how to do this in a Python job, and the small configuration steps needed to make it happen.
The following Python code demonstrates how to read all text files from a bucket into a list of strings. While this approach may not be practical for AI training workloads, it will help illustrate the necessary configurations.
1 import boto3
2 import argparse
3 import os
4
5 def get_bucket_name():
6 parser = argparse.ArgumentParser(description="Read text files from an AWS S3 bucket.")
7 parser.add_argument("bucket_name", help="The name of the bucket to read from.", nargs='?')
8 args = parser.parse_args()
9
10 if args.bucket_name:
11 return args.bucket_name
12 elif "BUCKET" in os.environ:
13 return os.environ["BUCKET"]
14 else:
15 print("Error: Bucket name not provided. Please provide it as a command line argument or set the BUCKET environment variable.")
16 return None
17
18 def read_bucket_files(bucket_name):
19 s3_client = boto3.client('s3')
20 file_contents = []
21 response = s3_client.list_objects_v2(Bucket=bucket_name)
22 for obj in response.get('Contents', []):
23 if obj['Key'].endswith(".txt"):
24 file_obj = s3_client.get_object(Bucket=bucket_name, Key=obj['Key'])
25 content = file_obj['Body'].read().decode('utf-8')
26 file_contents.append(content)
27 return file_contents
28
29 if __name__ == "__main__":
30 bucket_name = get_bucket_name()
31 if bucket_name:
32 file_contents = read_bucket_files(bucket_name)
33 else:
34 exit(1)
35
36 for content in file_contents:
37 print(content)
To execute the above code in a containerized training job on Kubernetes, you will need to do the following:
Include the AWS SDK for Python (Boto3) into the container.
Inject a secret into the container, allowing proper authorization to the S3 API.
To install the boto3
Python library, use pip
to install it into the container image. For example, use the following Dockerfile:
1 FROM python:3.12
2
3 RUN pip install boto3
4 RUN mkdir /app
5 COPY read.py /app/
6
7 ENTRYPOINT ["python"]
8 CMD ["/app/read.py"]
To inject the secret into the container, use a Kubernetes secret mounted as a file with the environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. The example below uses the same secret as the one created above.
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: s3-reader
5 spec:
6 containers:
7 - name: s3-reader
8 image: nvcr.io/<orgID>/s3-read:latest
9 imagePullPolicy: IfNotPresent
10 command: ["python", "/app/read.py"]
11 env:
12 - name: BUCKET
13 value: <bucket-name>
14 - name: AWS_ACCESS_KEY_ID
15 valueFrom:
16 secretKeyRef:
17 name: aws-s3-creds
18 key: aws_access_key_id
19 - name: AWS_SECRET_ACCESS_KEY
20 valueFrom:
21 secretKeyRef:
22 name: aws-s3-creds
23 key: aws_secret_access_key
24 volumeMounts:
25 - name: aws-credentials
26 mountPath: /root/.aws/
27 readOnly: true
28 imagePullSecrets:
29 - name: nvcr.io-creds
30 volumes:
31 - name: aws-credentials
32 secret:
33 secretName: aws-s3-creds
34 restartPolicy: OnFailure
Note
Currently, there isn’t a method to mount Kubernetes secrets as files in Run:ai jobs. Run:ai supports only environment variables. Since the AWS SDK for Python requires access keys (which should be stored as secrets), you will need workarounds to use them with Run:ai jobs. The recommended workarounds are:
only use Kubernetes native jobs for data manipulation, or
copy the AWS credentials into shared storage and point the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environmental variables at that shared storage location.
6.6. Running Visual Studio Code Inside a Workload
In this example, we step through the process of creating a Workspace via the Run:ai Command-Line Interface (CLI) that runs a VS Code instance, and exposing the necessary port to our local workstation in order to access the VS Code web interface.
6.6.1. Prerequisites and Requirements
The following are required before running the VS Code workspace:
You must have the user role L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.
Your user must be able to access a project and department.
You must have installed the Run:ai CLI tool, as detailed in the Advanced documentation.
6.6.2. Creating the Workspace via CLI on a local workstation
Open a terminal on your workstation (or other system that has the Run:ai CLI installed and configured on it).
Note
If the system you have installed the Run:ai CLI on is a remote system, you can SSH to that system and create a local port forward, connecting the port Run:ai will use with the same port on your local workstation. For example,
ssh -L 8080:localhost:8080 the-ip-or-hostname-of-your-remote-system
.Make note of your target project.
Enter the following command and arguments into your terminal (inserting your target project as the argument to the
--project
flag) and hit enter.runai submit vscode --project your-project-name-here --image linuxserver/code-server:4.92.2 --environment PUID=0 --environment PGID=0 --interactive
(Optional) Validate that the project is running with the following command.
runai describe job vscode --project your-project-name-here
Forward a port to your local workstation through the Run:ai CLI with the following command. The first port in the argument to
port
is arbitrary - it can be changed to any unused port on your system, instead of8080
.runai port-forward vscode --project your-project-name-here --port 8080:8443
To access the VS Code web interface, navigate to http://localhost:8080.
6.6.3. Cleaning up the Environment
After you are finished with the workload, it can be stopped or deleted to free up resources for other workloads. To delete the workload from the CLI, use the following command.
runai delete --project demo-project job vscode
6.7. Using WandB with a Workspace
In this example, we will create an interactive Workload with a Weights and Biases (WandB) integration for experiment tracking. For a training Workload, a similar process to creating and adding the credential and environment variable can be followed. Once the API Key is connected to your Workload, your WandB run should automatically connect to your account.
6.7.1. Prerequisites and Requirements
The following are required before using the WandB integration with an interactive Workload:
You must have the user role of L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.
Your user must be able to access a project and department.
You must have access to a compute resource in your scope that you can use.
You must have access to a Weights and Biases account and have an API key.
6.7.2. Creating the Credentials
To securely insert the necessary keys into the workload, we will create a Credential for the access key ID and the secret access key, allowing access to WandB.
To create a new Credential, go to the Credentials page. Click + NEW CREDENTIALS and then Access key to open the access key creation form.
On the form, set the desired scope. The scope should be as narrow as possible—in this case, set the scope to match the targeted project for your workload.
Give the access key a descriptive name, such as
wandb-access-key
, and add a description if desired.In the Secret pane, select New secret. Set the Access key field to
WANDB_API_KEY
, and the Access secret field to your WandB API Key (this can be found in your profile settings on the WandB site).Click CREATE CREDENTIALS to save the new Credential.
6.7.3. Creating the Workspace
Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.
Select the desired project to run your job in.
Leave the Template pane of the form set to Start from scratch.
Enter a descriptive name for your workload, such as
wandb-training
. Click Continue. After a few seconds, the Environment pane of the creation process will appear.To create a new environment for the workload, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.
In the Environment creation form, enter a name for the environment, such as
wandb-training-env
, and optionally add a description.Under Image URL, put
nvcr.io/nvidia/pytorch:24.08-py3
. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases but if you are pushing new containers to the NGC org with the same tag “always pull the image from the registry” should be used to check if there are updates to the image.Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.
Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the workspace, and more.
Since we want to run a Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate, and Container port “8888”.
Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter
jupyter-lab
as the command, and--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser
for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.Do not modify any other settings.
Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.
Go to the Compute resource pane and select a single GPU compute resource for your environment.
To embed the Credentials into the workload, click the + ENVIRONMENT VARIABLE button.
For the Name field, enter
WANDB_API_KEY
, set the Source toCredentials
, set the Credentials name field towandb-access-key
, and set the Secret Key field toSecretKey
.Note
If your WandB project is connected to a specific WandB team, an additional environment variable may be needed. For the Name field, enter
WANDB_ENTITY
, set the Source toCustom
, and set the Value field to the name of your team.Go to the Compute resource pane and select your desired compute resource. For this example, a single GPU resource is sufficient.
Go to the bottom of the page and click CREATE WORKSPACE. This will add your workload to the queue and it will automatically get scheduled once cluster resources are available.
After creating the Workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your workspace is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster.
Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.
6.7.4. Connecting to WandB in your Jupyter Notebook
PyTorch does not have WandB integrated automatically in the container. Open a terminal tab in your Jupyter lab session to download the WandB client. Run the following:
pip install wandb
Open a new notebook and run the following code to test your WandB connection:
import wandb import random # start a new wandb run to track this script wandb.init( # set the wandb project where this run will be logged project="my-awesome-project", # track hyperparameters and run metadata config={ "learning_rate": 0.02, "architecture": "CNN", "dataset": "CIFAR-100", "epochs": 10, } ) # simulate training epochs = 10 offset = random.random() / 5 for epoch in range(2, epochs): acc = 1 - 2 ** -epoch - random.random() / epoch - offset loss = 2 ** -epoch + random.random() / epoch + offset # log metrics to wandb wandb.log({"acc": acc, "loss": loss}) # [optional] finish the wandb run, necessary in notebooks wandb.finish()After running this code, you should see a new project called
my-awesome-project
in your WandB account.
For your own WandB experiments, adding the API key will automate the login process so your own code should run automatically connected to your account.
6.7.5. Cleaning up the Environment
After the workload has completed, it can be deleted from the Workloads page, or left present as a reference for a future data download workload.
6.8. Distributed PyTorch Training Job
In this example, we demonstrate how to run a multi-node training job using the PyTorch training operator from Kubeflow.
6.8.1. Requirements
For running a Distributed PyTorch training job, a custom Docker container needs to be built. This is necessary to get relevant code into a container that can be run and shared for all future jobs.
To build a custom Docker container, the Docker runtime needs to be installed on a local machine with the same CPU architecture as the DGX Cloud hosts (that is, if the DGX Cloud hosts have AMD64-based CPUs, the container needs to be built on an AMD64 machine. If the DGX Cloud hosts are arm-based, the container needs to be built on an ARM CPU). To install Docker locally, follow the Docker Engine Install guide.
Additionally, you need to authenticate with your NGC private registry using your NGC key created during the Credentials section above. To authenticate with NGC, run the following commands (note, the username must be exactly “$oauthtoken”):
1 $ docker login nvcr.io 2 Username: $oauthtoken 3 Password: <NGC Key here>
6.8.2. Creating a Custom Docker Container
On your local machine where Docker is installed, create and navigate to a directory to save the Dockerfile, such as
pytorch-distributed
:1 $ mkdir -p pytorch-distributed 2 $ cd pytorch-distributed
In the new directory, open a new file named
run.sh
and copy the following contents to the file:1 #!/bin/bash 2 3 torchrun multinode.py --batch_size 32 100 25 4 cp snapshot.pt /checkpoints
This is a very simple script that uses
torchrun
to launch a distributed training job and copies the generated checkpoint to the/checkpoints
directory inside the container so it can be used again later.Save and close the file. Next, open another new file named
Dockerfile
and copy the following contents to the file:1 FROM nvcr.io/nvidia/pytorch:24.07-py3 2 3 WORKDIR /runai-distributed 4 RUN git clone https://github.com/pytorch/examples 5 6 WORKDIR /runai-distributed/examples/distributed/ddp-tutorial-series 7 COPY run.sh .
This Dockerfile uses the 24.07 PyTorch container hosted on NGC as a base, clones the official PyTorch examples repository inside the container, and copies the
run.sh
file created previously into the container. Save and close the file.Once both files have been saved locally, build a container with the following command, replacing <ORG ID> with the ID of your private registry on NGC:
docker build -t nvcr.io/<ORG ID>/pytorch-ddp-example:24.07-py3 .
This will build the custom container locally.
Once the build has finished, push the image to the NGC private registry with:
docker push nvcr.io/<ORG ID>/pytorch-ddp-example:24.07-py3
The custom container will be available in your private NGC registry and can be used immediately for jobs.
6.8.3. Creating a Data Source
To make it easier to reuse code and checkpoints in future jobs, a data source is created, specifically a PVC. The PVC can be mounted in jobs and will persist after the job completes so any data created can be reused.
To create a new PVC, go to the Data Sources page. Click New Data Source then PVC to open the PVC creation form.
On the new form, set the desired scope.
Important
PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.
Give the PVC a memorable name like
distributed-pytorch-checkpoint
and add a description if desired.For the data options, select a new PVC storage class that suits your needs according to the PVC recommendations here. In this example,
zonal-rwx
is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter10 TB
for the size to ensure we have plenty of capacity for future jobs. Select Filesystem as the volume mode. Lastly, set the Container path to/checkpoints
which is where the PVC will be mounted inside containers. This completed section should look as shown below.Once the form is complete, create the data source using the button at the bottom of the page.
Note
The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.
6.8.4. Creating the Workload via the UI
To create the training job, navigate to the Workloads page and click New Workload > Training.
On the project creation form page, select the desired project to run the job in.
Select Distributed for the workload architecture to run a multi-node job. This adds a drop-down menu which lists the available frameworks for running distributed jobs. Since we are running a PyTorch job, select the PyTorch framework.
The distributed training configuration has an option for either Workers & master or Workers only.
Note
In this documentation, we will use the term ‘primary’ instead of ‘master’ to align with modern terminology. Please note that the UI and commands may still refer to it as ‘master’.
A primary node is typically required when running multi-node jobs that need to coordinate with one another, such as a job that does an all_reduce operation. The primary node can either be identical to the workers and run the training process or be a lightweight pod that only does coordination. If a job doesn’t need coordination between processes, the Workers only option can be used. For our example, we want the pods to coordinate with each other. Select the Workers & master option.
Ensure the Start from scratch option is selected for the template.
Give the job a name like “distributed-training-pytorch” and click Continue.
In the new environment form that opens, select New environment. In the new form, provide a name for the environment such as “distributed-pytorch-env” and optionally add a description.
In the Image URL, provide the name of the image tag that was specified during the container build in the Creating a Custom Docker Container section above, such as
nvcr.io/<ORG ID>/pytorch-ddp-example:24.07-py3
. This will use the container that was pushed to your private NGC org for the job.If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check if there are updates to the image.
Most of the remaining settings on the form can be kept as default with the exception of the runtime settings. The runtime settings indicate the directory to open pods to and what command to launch inside the containers if something other than the default container values are desired.
In our example, we want to launch the
multinode.py
script withtorchrun
which runs a multi-process application where each process has its own unique rank. The PyTorch Training Operator coordinates with torchrun to automatically set theRANK
,LOCAL_RANK
, andWORLD_SIZE
, among other environment variables, based on the total number of GPUs allocated for a job. In the Commands and Arguments field, entertorchrun multinode.py --batch_size 32 100 25
for the command. This will run the distributed training job on all allocated GPUs with a batch size of 32 for 100 total epochs and save a checkpoint every 25 epochs.For the directory field, put
/runai-distributed/examples/distributed/ddp-tutorial-series
. This is the location of the scripts that were saved inside of the container we built and is the path we want the pod to open to inside the container when it starts.After selecting the Create Environment button, you will be returned to the job creation page with the new environment selected. In the compute resource pane, select the desired number of workers. Note that since we are using a primary and worker configuration where we want the primary to participate in the training, we will specify one fewer worker node than the total number of nodes expected for the job. In other words, if we want to run an 8 node job where the primary will be coordinating processes as well as training, we will specify 7 workers and 1 primary (which is the next form). In our example we will train with 2 total nodes. As a result, select 1 worker in the form.
For the compute resource, we want to train on a full DGX node with all available GPU resources. Select the compute resource in your project with a GPU capacity of 8.
In the Data Sources pane, select the name of the PVC that was created in the previous section. The name of the PVC in this example is “distributed-pytorch-checkpoints”. Click Continue to navigate to the primary configuration page.
As mentioned previously, the primary pod can have a different configuration from the worker pods for distributed PyTorch jobs. For our example, we want to use the same setup for workers and the primary so ensure the Allow different setup for the master radio button is deselected.
Once finished, click Create Training to queue the training job.
6.8.5. Monitoring the Job
After the job is submitted, the status can be viewed in the Workloads page. Find the job name in the list, such as “distributed-training-pytorch” per the previous example. Clicking on the job will show a form with additional information about the job, including event history, metrics, and logs.
The event history tab shows the current status of the job, such as Creating, Initializing, Pending, Running, and so on. Additionally, it shows a log with timestamps of all of the states over time including any output messages when applicable. This is useful to see when a job is running or if it failed for any reason.
The metrics tab contains several graphs showing pod resource usage, including GPU, CPU, and memory utilization. This helps determine how much a job is stressing the hardware resources or potentially seeing if the application is under-utilizing resources. At the top of the tab, there is a dropdown to select metrics for a specific pod or for an aggregate of all pods for finer granularity.
Lastly, the logs tab shows the live STDOUT for each of the pods, including the aggregate output. In our example application, there will be output showing the training progress, including the epoch, batch size, and steps on a per-GPU basis. The logs are a good place to view training progress or possibly see if there are any errors in the process.
6.8.6. Getting the Checkpoint
The end of the run.sh script we created copies the latest generated checkpoint to the PVC attached to the job. Any job that uses this same PVC will now be able to load the checkpoint at /checkpoints/snapshot.pt
inside a job. Similarly, any data can be saved to the PVC in a job at the specified filesystem path. This is helpful while generating checkpoints in long-running training jobs to continue progress in future jobs or change hyperparameters between jobs.
6.8.7. Cleaning up the Environment
After the job finishes, it can be deleted to free up resources for other jobs. Note that certain jobs do not self-terminate, and jobs should be monitored to ensure resources are not sitting idle for completed jobs.
6.9. Using Zero Quota Projects for Lower Priority Workloads
In this example, we walk through the process of creating a zero quota project for lower priority workloads, ensuring that they will always be preemptible.
This example is intended for cluster administrators managing the usage of the cluster, and demonstrates that certain workloads and projects can be prioritized over others for mixed usage of the cluster. In this manner, lower priority workloads can be preempted when higher priority workloads need to be run, and resumed once higher priority workloads are completed.
Once the project and quota configuration is completed by the cluster administrator, users can then run workloads in their assigned projects that fit the lower or higher priority use cases, and their workloads will be scheduled and preempted as configured.
6.9.1. Requirements
Access to the
default
department or a department that has enough quota to use the entire cluster’s resources. For more information on configuring departments and projects, see Departments.Ability to create projects and configure quotas. For more information on required user roles, see Cluster Users and Managing Users.
Note
Only users with the role of Application Administrator, Department Administrator, Editor, and Research Manager can create projects.
No workloads are currently running in the cluster (if other workloads are running, they may interfere with the scheduler behavior).
Note
The Run:ai scheduler provides more configuration capabilities, such as over-quota priority, that can influence the behavior of workloads on the cluster (such as if other workloads are also running). For more information about the Run:ai scheduler, refer to the Run:ai documentation.
6.9.2. Creating the Maximum Quota Project via the UI
In this section, we will create a project with enough quota to use the entire cluster’s compute resources.
Note
This is a setup step that only needs to be executed once. The maximum quota project will persist unless removed.
Navigate to the Projects page and click the + NEW PROJECT button.
In the Scope field, ensure the
default
department is selected (or an alternative department with the ability to use all of the cluster’s available resources).Enter a name like
full-quota-project
in the Project name field.Under Quota management, set the GPU devices field to use all GPUs available to the department, and set the CPUs (Cores) and CPU memory fields to
Unlimited
.Change no other fields, and click the CREATE PROJECT button.
6.9.3. Creating the Zero Quota Project via the UI
In this section, we will create a project with zero quota. Workloads running in this project will still be able to use compute resources, but will always be over quota and subject to preemption.
Note
This is a setup step that only needs to be executed once. The zero quota project will persist unless removed.
Navigate to the Projects page and click the + NEW PROJECT button.
In the Scope field, ensure the same department is selected as the one used in the previous section.
Enter a name like
zero-quota-project
in the Project name field.Under Quota management, set the GPU devices, CPUs (Cores), and CPU memory fields to
0
.Change no other fields, and click the CREATE PROJECT button.
6.9.4. Creating the Lower Priority Interactive Workload via the UI
In this section, we will create and start a workspace in the zero-quota project.
Note
The same scheduling behaviors exist when using a training workload in a zero quota project. Some of the exact steps and environments used will differ.
Navigate to the Workloads page and click the + NEW WORKLOAD button. Select the Workspace option.
Select the
zero-quota-project
option under the Projects section.Select Start from scratch under the Template section.
Enter a name like
zero-quota-workspace
under the Workspace name section and click the CONTINUE button.Select the
jupyter-lab
option under the Environment section.Select the
h100-1g
option under the Compute resource section.Open the General section and toggle the
Allow the workspace to exceed the project's quota
switch to on (it will appear blue when on). Otherwise, the workspace will not start.Note
You can set a policy for workspaces to enable this option by default. To do so, set the
spec.preemptible.rules.required: true
parameter in an interactive workspace policy, following the instructions Workspaces Policy .More details about applying policies via CLI can be found in Run:ai documentation Former Policies .
Click the CREATE WORKSPACE button.
The UI will automatically return to the Workloads page - observe the
zero-quota-workspace
workload enter aRunning
state.
6.9.5. Creating the Higher Priority Training Workload via the UI
In this section, we will create and start a training workload in the full quota project. This training workload will preempt the workload initiated in the zero quota project.
Navigate to the Workloads page and click the + NEW WORKLOAD button. Select the Training option.
Select the
full-quota-project
option under the Projects section.Select the
Distributed
option under the Workload architecture section.Select the
PyTorch
option when the Set the framework for distributed trainings dropdown menu appears in the Workload architectures section.Select the
Workers only
option when the Set the distributed training configuration options appear in the Workload architectures section.Select Start from scratch under the Template section.
Enter a name like
full-quota-training
under the Training name section and click the CONTINUE button.Click the + NEW ENVIRONMENT button under the Environment section.
In the next screen, enter a name like
simple-pytorch-environment
under the Environment name & description section.Set the Image URL field to
nvcr.io/nvidia/pytorch:24.07-py3
under the Image section.Click the CREATE ENVIRONMENT button.
Upon being returned to the previous screen, ensure that the
simple-pytorch-environment
that we just created is selected in the Environment section.click the Runtime settings section.
Click the + COMMAND & ARGUMENTS button.
Enter
sleep 5m
under the Command field.In the Set the number of workers for your training field, choose a number that uses all available GPU nodes in the cluster. For example, in a 32 GPU cluster, that number will be
4
(8 GPUs per node, 8*4=32).Select the
h100-8g
option under the Compute resource section.Click the CREATE TRAINING button.
The UI will automatically return to the Workloads page - observe the
zero-quota-workspace
workload entering aPending
state first, followed by thefull-quota-training
workload entering aRunning
state.If the
full-quota-training
workload is completed or is stopped, thezero-quota-workspace
workload will resume if there are free resources available in the cluster.
6.9.6. Cleaning up the Environment
All test workloads and projects used in this example can be safely deleted.
6.10. End-to-end NeMo Framework Workflow
This workflow provides a full end-to-end example of preparing a dataset, training a foundation model based on Mixtral-8x7B, and deploying the model for non-production inference using NeMo Framework. This guide will be split into sub-sections to describe each part in detail.
6.10.1. Requirements
The following is a list of requirements to follow this complete workflow:
A local machine with the Run:ai CLI installed following this link.
A Hugging Face account with an API token (setup steps in the following section).
A Weights and Biases account with an API token (setup steps in the following section).
A Run:ai cluster with at least 4x A100 or newer GPU nodes.
A Run:ai project identified for use along with the corresponding Kubernetes namespace for that project (by default, the project name with a
runai-
prefix).Argo Workflows set up and installed in the identified Run:ai namespace, as documented here.
A Run:ai user account with at least L1 Researcher privileges.
6.10.2. Initial setup
This guide uses two external services to simplify the LLM development process: Hugging Face and Weights & Biases.
Hugging Face contains resources for many of the most popular language models and datasets in the community. We can leverage these resources while training the model to minimize deployment steps and be consistent with community model assumptions.
This workflow walks through training a Mixtral-8x7B model from scratch. The dataset we use needs to be tokenized using a custom tokenizer. Luckily, MistralAI, the company that produced the Mixtral models, published their tokenizer for the Mixtral models on Hugging Face. In order to use the tokenizer, we need to create a Hugging Face account and accept the Mixtral-8x7B-v0.1 license on their model repository page. The following walks through that process.
6.10.2.1. Hugging Face Account Creation
If you don’t have a Hugging Face account already, create one by going to https://huggingface.co/join and signing up with your corporate email account.
Once your account is set up, go to https://huggingface.co/settings/tokens while logged in to create a personal access token. Create a new token with Read access and give it a memorable name. Save the generated token in a safe place, as it won’t be viewable again for security reasons.
6.10.2.2. Accepting Mixtral-8x7B License
As mentioned earlier, this example uses the official Mixtral-8x7B tokenizer available on Hugging Face, which requires agreeing to their license on their model page. To do so, navigate to https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 while logged in. Read the privacy policy at the top of the model card page, then click the Agree and access repository button towards the top of the page to accept the license. Now, you can download resources from this repository using your personal access token.
6.10.2.3. Creating Weights & Biases Account
Weights & Biases is a tool that allows developers to easily track experiments for AI applications. NeMo Framework natively supports logging many values such as training loss, learning rate, and gradient norm as well as resource utilization with Weights & Biases. Weights & Biases is highly recommended for tracking NeMo Framework jobs.
To get started with Weights & Biases, navigate to https://wandb.ai in a web browser and click the Sign Up button in the top right to create a free account. Once logged in, go to https://wandb.ai/settings and go to the bottom to create a new API key. This API key will be used while launching workflows to automatically log to Weights & Biases.
6.10.2.4. Create a PVC
Training an LLM requires a lot of data, including pre-training datasets, multiple checkpoints, long log files, configs, and scripts. These files typically need to be read from all nodes, so we need shared storage that all pods can access concurrently. For this, we can use a PVC that will store all of our training resources.
Note
It is strongly recommended to allocate as much storage as practical for a PVC. With GCP clusters, the larger the PVC size, the faster the read and writes will be, enabling large checkpoints to be saved more quickly during training, reducing the overall training time. If possible, allocate the entire cluster storage capacity in a single PVC and share that PVC amongst all workloads with unique subdirectories for each workload to take advantage of the highest possible storage performance.
To create a PVC, go to the Data Sources tab in the Run:ai web UI and follow these steps:
Click the New Data Source button at the top of the page followed by PVC.
Select the scope that corresponds with the project you will be training the model in. For example, if your project is named
default
, then select thedefault
project scope.Enter a memorable name for the PVC, such as
nemo-workspace
and optionally, give it a description.For the data mount, select New PVC.
Select the zonal-rwx storage class and Read-write by many nodes access mode.
For the claim size, enter at least 10 TB. If training a larger model and using a larger dataset, it might be necessary to request more storage capacity.
Enter
/nemo-workspace
for the container path. This will mount the PVC at/nemo-workspace
inside all pods that attach this PVC.Click Create Data Source once you have finished filling out the form to create the PVC.
6.10.3. Create Training Helper with JupyterLab
We will be creating a workflow that sets up JupyterLab to save and modify helper scripts as well as monitor the data prep and training processes. The process will be very similar to the setup in the Interactive NeMo Workload Job guide.
Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.
Select the desired project to run your job in.
Leave the Template pane of the form set to Start from scratch.
Enter a descriptive name for your Workspace, such as
nemo-training-helper
. Click Continue. After a few seconds, the Environment pane of the creation process will appear.Click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.
In the Environment creation form, enter a name for the environment, such as “nemo-jupyter” and optionally add a description.
Under Image URL, put
nvcr.io/nvidia/nemo:24.07
. This pulls the latest NeMo container from NGC as of the time of writing.Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already.
Click the Tools pane to open the tools menu, and click +Tool. Select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate, and Container port
8888
.Click the Runtime settings pane, then click to expand the commands and arguments pane. Enter
jupyter-lab
as the command, and--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser
for the arguments.For the directory, specify
/nemo-workspace
as the directory to launch the job in.Click Create Environment in the bottom right of the page. The environment you just created should now be selected.
Go to the Compute resource pane and select a CPU-only compute resource for your environment.
Select the
nemo-workspace
PVC created in the previous section under the Data sources form.Go to the bottom of the page and click Create Workspace.
After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”.
Once the status is “Running”, you can launch the JupyterLab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.
With the JuptyerLab session running, we can open a Terminal in the browser to interact with the PVC. It is helpful to keep this workflow running throughout the entire project in case files need to be inspected.
6.10.4. Data Preparation
The NeMo Framework supports processing custom text-based datasets for pre-training new models. The data preprocessor requires datasets to be cleansed, excluding any sensitive or improperly formatted data that is unsuitable for use during pre-training. Each file in the dataset must be in .json
or, ideally, .jsonl
format. Datasets can be downloaded from external sources or uploaded directly to the PVC.
The following example walks through downloading, extracting, concatenating, and preprocessing the SlimPajama dataset which includes a large corpus of text from several domains and has been deduplicated and cleaned to make it a great candidate for pre-training LLMs. While the remainder of the document will be based on the SlimPajama dataset, this general process can be followed for most custom datasets and will provide guidance on how to adapt as needed.
6.10.4.1. Script Setup
We will leverage four different scripts to prepare the SlimPajama dataset for pre-training a Mixtral-8x7B-based LLM. These scripts will be saved in the PVC that was created during the initial setup step. The scripts are as follows:
Download
The first script downloads the entire SlimPajama-627B training dataset from Hugging Face to the mounted PVC. The dataset is spread across nearly 60,000 individual shards, all needing to be downloaded independently. To make the process faster, the job leverages PyTorch distributed communication to spread the download equally amongst all workers in the cluster. Using the JupyterLab session created previously, save the following file in the PVC at
/nemo-workspace/download.py
.Note
The dataset is evenly divided amongst ten chunks on Hugging Face, each being its own subdirectory of files. The
download.py
script below has aCHUNKS = 10
variable at the top of the file to download all ten chunks. If desired, this value can be reduced to only download the first N chunks of the dataset. This is useful for quick workload validations that don’t rely on a complete dataset. The remainder of this document will assume all ten chunks are pulled from but the steps will still work if using less.import os import requests import time import torch CHUNKS = 10 SHARDS = 6000 torch.distributed.init_process_group() wrank = int(os.environ.get('RANK', 0)) wsize = int(os.environ.get('WORLD_SIZE', 0)) def download(url, filename, retry=False): if os.path.exists(filename): return response = requests.get(url) # In case of getting rate-limited, wait 3 seconds and retry the # download once. if response.status_code == 429 and not retry: time.sleep(3) download(url, filename, retry=True) if response.status_code != 200: return with open(filename, 'wb') as fn: fn.write(response.content) def split_shards(wsize): shards = [] shards_to_download = list(range(SHARDS)) for shard in range(wsize): idx_start = (shard * SHARDS) // wsize idx_end = ((shard + 1) * SHARDS) // wsize shards.append(shards_to_download[idx_start:idx_end]) return shards for chunk in range(1, CHUNKS + 1): shards_to_download = split_shards(wsize) for shard in shards_to_download[wrank]: filename = f'example_train_chunk{chunk}_shard{shard}.jsonl.zst' url = f'https://huggingface.co/datasets/cerebras/SlimPajama-627B/resolve/main/train/chunk{chunk}/example_train_{shard}.jsonl.zst' download(url, filename) # Block individual processes from exiting until all ranks are finished # with their work to start the next step at the same time torch.distributed.barrier()Extract
The individual dataset shards are compressed in the Zstandard or
.zst
format and must be decompressed. The following script divides the downloaded files amongst all ranks and decompresses shard before removing the compressed downloads to keep the PVC clean. Using the JupyterLab session, save the script in the PVC as/nemo-workspace/extract.py
.import os import requests import subprocess import torch from glob import glob torch.distributed.init_process_group() wrank = int(os.environ.get('RANK', 0)) wsize = int(os.environ.get('WORLD_SIZE', 0)) def split_shards(wsize, dataset): shards = [] for shard in range(wsize): idx_start = (shard * len(dataset)) // wsize idx_end = ((shard + 1) * len(dataset)) // wsize shards.append(dataset[idx_start:idx_end]) return shards dataset = glob('example_train*') shards_to_extract = split_shards(wsize, dataset) for shard in shards_to_extract[wrank]: subprocess.run([f"unzstd --rm {shard}"], shell=True) # Block individual processes from exiting until all ranks are finished # with their work to start the next step at the same time torch.distributed.barrier()Concatenate
Given the SlimPajama dataset contains nearly 60,000 files, it is helpful to concatenate them into fewer, larger files. It is much faster to process a smaller number of large files than a large number of small files and will, in turn, eliminate potential data bottlenecks during the pre-training stage.
The following script takes 1,200 individual shards at a time and combines them into one large file, repeating for the entire dataset. Each rank concatenates a unique subsection of the dataset and deletes the individual shards in the end. Using the JupyterLab session, save the script in the PVC as
/nemo-workspace/concat.sh
.Note
The script combines 1,200 individual shards by default into a single file. For the complete dataset, this will yield 50 larger combined files representing the data, each being approximately 51 GB in size. To change how many shards are used in each file, increase or decrease the
shards_per_file
variable below. A larger number will result in fewer files that are larger in size. A smaller number will result in more files that are smaller in size.#!/bin/bash shards_per_file=1200 num_files=`find -name 'example_train_chunk*.jsonl' | wc -l` files=(example_train_chunk*.jsonl) rank=$RANK world_size=$WORLD_SIZE # Find the ceiling of the result shards=$(((num_files+shards_per_file-1)/shards_per_file )) echo "Creating ${shards} combined chunks comprising ${shards_per_file} files each" for ((i=0; i<$shards; i++)); do if (( (( $i - $rank )) % $world_size )) ; then continue fi file_start=$((i*shards_per_file)) if [[ $(((i+1)*shards_per_file)) -ge ${#files[@]} ]]; then file_stop=$((${#files[@]}-1)) else file_stop=$(((i+1)*shards_per_file)) fi echo " Building chunk $i with files $file_start to $file_stop" cat ${files[@]:$file_start:$shards_per_file} > slim_pajama_${i}.jsonl rm ${files[@]:$file_start:$shards_per_file} donePreprocess
Once all of the files have been concatenated, it is time to preprocess the dataset. The preprocessing phase tokenizes each dataset file using the Mixtral-8x7B tokenizer which is downloaded from Hugging Face and creates
.bin
and.idx
files for each concatenated file. As with the other scripts, this one divides the work amongst all available workers to speed up preprocessing. Using the JupyterLab session, save the following script in the PVC as/nemo-workspace/preprocess.py
.Note
As mentioned, this script uses the Mixtral-8x7B tokenizer as the intent is to use this data for pre-training a Mixtral-8x7B model. However, the tokenizer can be swapped out for a different one available on Hugging Face if pre-training a different model is desired. For example, the Llama3.1-8B tokenizer from Meta can be used instead by replacing both references of
mistralai/Mixtral-8x7B-v0.1
in the script with the repo ID of the Llama3.1-8B model,meta-llama/Meta-Llama-3.1-8B
and update the filename and path to the tokenizer in the model repo which happens to befilename=original/tokenizer.model
. Be sure to accept any applicable licenses on the model repository page.import os import requests import subprocess import torch from datetime import timedelta from glob import glob from huggingface_hub import hf_hub_download # Wait for all processes to be fininshed with the previous step # before starting preprocessing torch.distributed.init_process_group(backend='gloo') torch.distributed.monitored_barrier(timeout=timedelta(hours=4)) wrank = int(os.environ.get('RANK', 0)) wsize = int(os.environ.get('WORLD_SIZE', 1)) def split_shards(wsize, dataset): shards = [] for shard in range(wsize): idx_start = (shard * len(dataset)) // wsize idx_end = ((shard + 1) * len(dataset)) // wsize shards.append(dataset[idx_start:idx_end]) return shards dataset = sorted(glob('slim_pajama*jsonl')) shards_to_extract = split_shards(wsize, dataset) if wrank == 0: # Download a specific file from a repository hf_hub_download( repo_id="mistralai/Mixtral-8x7B-v0.1", filename="tokenizer.model", local_dir="/nemo-workspace/tokenizers/mixtral-8x7b" ) for num, shard in enumerate(shards_to_extract[wrank]): shard_num = wrank + (num * wsize) # Counter for which file is processed command = ( "python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py " f"--input {shard} " f"--output-prefix /nemo-workspace/mixtral-slim-pajama-{shard_num} " f"--dataset-impl mmap " f"--tokenizer-type mistralai/Mixtral-8x7B-v0.1 " f"--tokenizer-library huggingface " f"--tokenizer-model /nemo-workspace/tokenizers/mixtral-8x7b/tokenizer.model " f"--workers 80" ) subprocess.run([command], shell=True)
6.10.4.2. Launching Data Preparation
Note
In this documentation, we will use the term ‘primary’ instead of ‘master’ to align with modern terminology. Please note that the UI and commands may still refer to it as ‘master’.
Once all the scripts are saved in the PVC, it is time to launch the preprocessing job.
First, determine the total number of workers for the job. The greater the number of workers, the quicker the entire data preparation process will finish. The preprocessing step requires a GPU so you are limited by the number of GPUs available in your cluster. Assuming the resources aren’t needed for other purposes, it is recommended to create as many workers as there are GPUs available in the cluster for fastest end-to-end preparation time. Make a note of the desired number of workers, as this will be required in a few steps below.
Note
The example here will use single-GPU compute types for each worker. Run:ai limits the amount of CPU resources available in compute types that are smaller than a full node (that is, single GPU or two GPUs). Given this phase is primarily a CPU-bound process, the lower CPU resources will be a bottleneck in the process. If desired, a larger compute type with more GPUs can be used to take advantage of additional CPU resources for each worker. The tradeoff is with larger compute types, there will be fewer available resources, resulting in less workers. Both paths are valid but if your cluster has a large number of GPU resources, it is recommended to use larger compute types with multiple GPUs. More on this later.
Go to the Run:ai web UI in a browser and open the Workloads page.
Click the blue New Workload button followed by Training to start the creation process for a distributed training job.
In the new form that opens, select the desired project to run the job in.
In the workload architecture box, select the Distributed radio button, set the framework to PyTorch in the drop-down list, and select Workers & master for the workload configuration.
Select Start from scratch for the template if not already selected.
Give the job a name such as
slim-pajama-data-prep-mixtral
and hit the continue button.On the new page, select +New Environment to create a new environment with our container. This opens another form for creating the environment.
In the environment creation page, enter a name for the environment, such as
nemo-2407
, and optionally add a description.For the image URL, enter
nvcr.io/nvidia/nemo:24.07
, the latest training container at the time of writing. As newer containers are released, the tag can be updated to reflect the latest version.In the Runtime settings section, add a new Environment Variable with the name
LD_LIBRARY_PATH
and the following value which is the expected path for loading CUDA libraries in the NeMo 24.07 container./usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/tensorrt/lib:/usr/local/cuda/lib64:/usr/local/tensorrt/lib
Once finished setting up the environment, click the Create Environment button at the bottom of the page which will take you back to the worker setup form.
In the worker setup form, ensure the newly created environment is selected for the environment.
Expand the Runtime settings option and add a new command. For the command, put
bash -c
in the command box and the following code in the arguments field. Replace<HUGGING FACE TOKEN HERE>
with your Hugging Face token created earlier and all three instances of<NUM WORKERS>
with the desired number of workers determined at the beginning of this step. Be sure to include the single quotes at the start and end of the arguments field.'huggingface-cli login --token <HUGGING FACE TOKEN HERE> && apt update && apt install -y zstd && cd /nemo-workspace && torchrun --nnodes=<NUM WORKERS> --nproc-per-node=1 download.py && torchrun --nnodes=<NUM WORKERS> --nproc-per-node=1 extract.py && bash concat.sh && torchrun --nnodes=<NUM WORKERS> --nproc-per-node=1 preprocess.py'
Confirm that the
LD_LIBRARY_PATH
EV is still set in the Environment Variable field and set it again if not.Select the number of workers for your job. Note that since we used Workers & master for our distributed architecture, we will enter one fewer worker in this field than the total we determined at the start of this section. This is because the primary pod will coordinate all of the processes but still process its share of the dataset, so it will be the last worker on the list. In other words, if the desire is to use 16 total workers, put 15 for the number of workers on this form.
Select a compute resource for the job. It is recommended to use a single GPU compute resource for all of the workers but as stated in the note at the top of this section, compute resources with more than one GPU per worker can be used to leverage more CPU resources in each pod.
In the Data sources section select the PVC that was created earlier in this example. This will mount the PVC at
/nemo-workspace
inside all of the pods. Click the Continue button at the bottom of the page to open the primary configuration form.In several distributed PyTorch applications, it is common to use a different setup for the primary resource to coordinate distributed communication. A typical example is seting up the primary as an etcd server that acts as the PyTorch rendezvous agent. For our purposes, we want the primary to be just another worker and do everything that the workers do. Ensure the Allow different setup for the master toggle is deselected so the master uses an identical configuration to the workers. This will copy all of the settings from the previous workers’ form to the primary form. Click Create Training to launch the dataset preparation workflow.
After creating the data preparation job, a pod for each worker and primary will be scheduled and started once resources become available on the cluster. The process can be monitored by viewing the logs in the Run:ai UI and by connecting to the helper JupyterLab terminal and viewing the data in the PVC. The /nemo-workspace
will evolve throughout the process with the following changes at the end of each stage:
After downloading, there will be 59,166 compressed data shards named
example_train_chunkX_shardY.jsonl.zst
whereX
is the chunk number from 1-10 and Y is the individual shard number within that chunk. Each file is approximately 15 MB in size.After extraction, there will be 59,166 unzipped data shards named
example_train_chunkX_shardY.jsonl
and all of the compressed.zst
files will be removed. Each file is approximately 44 MB in size.After concatenation, there will be 50 large, combined files named
slim_pajama_N.jsonl
whereN
ranges from 0-49. Each file will be approximately 51 GB in size. It is normal for the last file to be smaller in size as it doesn’t contain an even 1,200 shards. All of the individualexample_train*
files will be removed.After preprocessing, there will be 50
.bin
files and 50.idx
files namedmixtral-slim-pajama-N_text_document
, whereN
corresponds to the combined data file number. Each.bin
file should be approximately 26 GB in size and.idx
files should be 229 MB.
Once all 50 files have been preprocessed, it is time to begin pre-training the model.
6.10.5. Pre-Training
NeMo Framework contains many predefined configuration files for various models, including the Mixtral-8x7B model. This section will demonstrate how to initiate training a Mixtral-8x7B model on Run:ai using the preprocessed SlimPajama dataset.
Pre-training is the most compute-intensive phase of the LLM training process as the model is typically trained for hundreds of billions to trillions of tokens while it learns the vocabulary and word pairings of the underlying dataset. Depending on the size of the dataset and model as well as the amount of compute resources available to train the model, this process can take anywhere from several days to a few months to finish. Therefore it is strongly recommended to leverage as much of your available compute power as possible for pre-training the model.
6.10.5.1. Setup
Some minor setup is required prior to launching the job. First, we need to get the path and weight to each dataset file so NeMo Framework knows which files to use for pre-training. The NeMo container has a script to generate this information.
To generate the data, open up the nemo-training-helper
Jupyter session in the browser again that was used during data prep setup. In a terminal session, run the following code:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/nemo_launcher/collections/auto_blend.py model_type=gpt preprocessed_dir=/nemo-workspace
This will output a list of the full paths to each item in the dataset as well as the porportional weight of each file based on the file size. The weight indicates how much of the dataset should be sampled from each file, giving the larger files a greater portion of the sample size. The script automatically does this for the most efficient sampling of all files. The list should look similar to the following list (truncated for brevity):
[0.020294,'/nemo-workspace/mixtral-slim-pajama-6_text_document',0.020406,'/nemo-workspace/mixtral-slim-pajama-27_text_document',0.020262,'/nemo-workspace/mixtral-slim-pajama-45_text_document',0.020367,'/nemo-workspace/mixtral-slim-pajama-14_text_document',0.020332,'/nemo-workspace/mixtral-slim-pajama-36_text_document',...]
Copy the generated output and save it for later, as it will be used while launching the training job.
Next, we need to download the launcher repository and set up our environment. Do so with the following:
Clone the launcher repository on a machine with
kubectl
access to the Run:ai cluster using:git clone https://github.com/NVIDIA/nemo-framework-launcher cd nemo-framework-launcher
Install the Python dependencies (a virtual or conda environment is recommended):
pip3 install -r requirements.txt cd launcher_scripts
Find the name of the PVC in kubernetes with
kubectl get pvc
. In the example below, the name of the PVC according to kubernetes isnemo-workspace-project-u2l6h
. Save the name for later as it will be used to specify which PVC to use.$ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE nemo-workspace-project-u2l6h Bound pvc-a184eba2-2e16-4530-a2d8-7f0839df27d0 50Ti RWX zonal-rwx <unset> 37d
Now the training job can be launched. NeMo Framework has default configs for each of the supported models available on GitHub. The specific config we will use as the baseline for our job can be found here.
NeMo Framework uses Hydra to specify all of the settings used for training jobs which are shown in the linked config file above. These settings can be overridden from the command line as needed. The following command will launch the pre-training job with the explanation of each flag below:
PVC_NAME=<INSERT PVC NAME HERE> # Replace with your PVC name captured earlier MOUNT_PATH=/nemo-workspace python3 main.py \ launcher_scripts_path=$PWD \ data_dir=/$MOUNT_PATH \ cluster=k8s_v2 \ "+env_vars={WANDB_API_KEY: <INSERT WANDB API KEY HERE>, HF_TOKEN: <INSERT HF TOKEN HERE>}" \ ~cluster.volumes.dshm \ cluster.volumes.workspace.persistent_volume_claim.claim_name=$PVC_NAME \ cluster.volumes.workspace.mount_path=$MOUNT_PATH \ cluster.scheduler=runai-scheduler \ cluster.service_account=argo \ cluster.namespace=runai-demo-project \ cluster.pull_secret=null \ 'stages=[training]' \ training=mixtral/mixtral_8x7b \ training.exp_manager.create_wandb_logger=true \ "training.exp_manager.explicit_log_dir=$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results" \ training.trainer.num_nodes=8 \ training.trainer.max_steps=300000 \ training.model.context_parallel_size=2 \ training.model.sequence_parallel=false \ training.model.data.data_prefix=<INSERT DATA PATHS AND WEIGHTS HERE> \ training.model.encoder_seq_length=8192 \ training.model.data.seq_length=8192 \ training.model.ub_tp_comm_overlap=false
The flag explanations are as follows:
PVC_NAME
: Specify the name of the PVC captured earlier.MOUNT_PATH
: This is the path that the PVC will be mounted inside the container. Keep this as/nemo-workspace
.data_dir=/$MOUNT_PATH
: This is the location of the dataset inside the PVC.cluster=k8s_v2
: This indicates the job will run on Kubernetes."+env_vars={WANDB_API_KEY: <INSERT WANDB API KEY HERE>, HF_TOKEN: <INSERT HF TOKEN HERE>}"
: Replace<INSERT WANDB API KEY HERE>
with your Weights and Biases API key created during the setup phase. This will allow the job to authenticate with the W&B servers. Similarly, replace<INSERT HF TOKEN HERE>
with your Hugging Face token created earlier to authenticate with Hugging Face.~cluster.volumes.dshm
: DGX Cloud on Run:ai automatically allocates shared memory in each pod and it will cause a conflict if NeMo Framework tries to allocate this memory as well. To prevent this, we need to remove thedshm
line in the config so it isn’t allocated. The~
at the beginning of the line removes the key.cluster.volumes.workspace.persistent_volume_claim.claim_name
: This is the name of the PVC to attach to the job.cluster.volumes.workspace.mount_path
: This tells NeMo Framework where the PVC will be mounted inside the container.cluster.scheduler=runai-scheduler
: By default, the job will try and use the default kuberentes scheduler but we want to use the Run:ai scheduler instead.cluster.service_account=argo
: Running Argo workflows on the cluster requires a service account. By default, theargo
service account should be used. Replace this with the name of your service account as needed.cluster.namespace=runai-demo-project
: This is the kubernetes namespace the job will run in. The namespace will typically map to the Project name in Run:ai withrunai-
prepended to it. For example, if the Project is nameddemo-project
, the namespace will typically berunai-demo-project
. You can list all namespaces withkubectl get ns
.cluster.pull_secret=null
: Run:ai also automatically injects the Docker container registry secret into jobs so this field can be kept empty.'stages=[training]'
: Specify that we want to run the training phase of NeMo Framework.training=mixtral/mixtral_8x7b
: This indicates we want to train a Mixtral-8x7B model. To train a different model, specify one of the configs listed in the repostiory here.training.exp_manager.create_wandb_logger=true
: Use Weights and Biases for logging all metrics."training.exp_manager.explicit_log_dir=$MOUNT_PATH/\${training.run.name}/training_\${training.run.name}/results"
: This is the location to save all of the training results for the job. This will be saved in the PVC for future use.training.trainer.num_nodes=8
: Specify the number of nodes to run on. As mentioned in the beginning of the section, it is highly recommended to use as many nodes as possible.training.trainer.max_steps=300000
: This is the total number of steps we want to train our model for. Given our dataset, SlimPajama, has 627 billion tokens, we want to train for at least one epoch. Each step represents a forward pass of the transformer network. In the forward pass, a number of tokens are trained equal to the global batch size times the sequence length. Given the default batch size of the Mixtral-8x7B model is 256 and we are using a sequence length of 8K tokens, each forward pass trains on 8,192 x 256 = 2,097,152 tokens. So, to train on all of the 627 billion tokens of the dataset, we will run for 627,000,000,000 / 2,097,152 = 298,976 steps which we round up to 300,000 total steps.training.model.context_parallel_size=2
: This adds a level of parallelism for the context to reduce the memory footprint on each GPU. As context windows get larger the overall memory requirements become larger. Context parallelism makes it easier to fit large context windows in GPU memory.training.model.sequence_parallel=false
: Disable sequence parallelism in favor of other parallelism techniques.training.model.data.data_prefix
: This is where you specify the weights and paths of the data that was generated from the python3 command earlier.training.model.encoder_seq_length=8192
: Specify the sequence length for the model. This is also referred to as the context length and represents the number of input and output tokens that the model can handle. Recent research suggests doing a bulk of the training on a smaller context length of around 8k tokens and increasing the context length after fully pre-training the model offers the best stability.training.model.data.seq_length=8192
: As with the previous flag, specify an 8K context length.training.model.ub_tp_comm_overlap=false
: DGX Cloud does not support overlapping communications and this should be disabled.Note
The global batch size (GBS) is dependent on the number of GPUs in the job, the micro batch size (MBS), tensor parallelism (TP), and pipeline parallelism (PP) sizes. Specifically,
GBS % (MBS * num GPUs) / (PP * TP)
must equal0
. For example, the Mixtral-8x7B model has a default GBS of 256, MBS of 1, TP of 1, and PP of 4. Given we are using 8 nodes with 8 GPUs each, this results in256 % (1 * (8 * 8)) / (1 * 4) == 0
and is a valid config. If changing the node or parallelism sizes, be sure the global batch size still satisfies this equation.
After running the Python command above, the job will be scheduled with Run:ai and launched once resources become available. The job will appear in the Run:ai Workloads page after submission. The following images show the workload details after it has been running for a few days.
NeMo Framework is fully integrated with Weights and Biases and logs multiple metrics that can be viewable on their website. If the W&B key was provided in the command, a new W&B project will automatically be created and metrics will be uploaded there. Viewing logs on W&B is recommended as the best path to monitor training progress.
To view your charts, navigate to https://wandb.ai. You should see a link to the newly created project on your home page. Clicking the link will take you to your project dashboard which should look similar to the following. Note that the figure below includes training results for two different runs where the second run is a continuation of the first.
Two of the most important charts to monitor during pre-training are the reduced_train_loss
and val_loss
charts which show how the model is learning over time. In general, these charts should have an exponential decay shape.
The job will take around four weeks to complete on 8 nodes. Since NeMo Framework pre-training scales linearly, doubling the number of nodes should halve the amount of time required to pre-train the model.
While the model trains, a checkpoint will be saved every 2,000 steps in the PVC. Per the command above, the checkpoints will be saved in the /nemo-workspace/mixtral_8x7b/training_mixtral_8x7b/results/checkpoints
directory. Only the 10 checkpoints with the best val_loss
values as well as the latest checkpoint will be saved.
After the job finishes, a .nemo
file will be saved in the checkpoint directory. This represents the pre-trained model and can be used for several downstream tasks including fine-tuning and inference.
6.10.6. Inference Deployment
Now that we have finished pre-training a base model, we can deploy it for inference and send requests to the deployed model to do quick human evaluations.
Warning
This section is NOT intended for production inference deployments. The purpose of this section is to provide a quick way for engineers, QA teams, and other internal stakeholders to evaluate the model with user-generated prompts and inform decisions on the model’s readiness. A production deployment would include load balancing, auto-scaling, optimized inference code, a full API, and more.
To deploy the model for inference, navigate to the Workloads page and click the + New Workload > Inference button and follow these steps:
In the new form that opens, select the desired project to run the job in.
Enter a name for the inference deployment, such as
mixtral-8x7b-base-model-deploy
and click the Continue button.Create a new environment by clicking the + New Environment button.
In the environment creation page, enter a name for the environment such as
nemo-2409-inference
and optionally add a description.For the image URL, enter
nvcr.io/nvidia/nemo:24.09
which is the latest training container at the time of writing. As newer containers are released, the tag can be updated to reflect the latest version.In the Endpoint section, ensure
HTTP
is selected for the protocol. Enter8080
for the container port. When we specify the command later on, we designate port8080
to listen to requests. If a model is deployed on a different port, specify that here.Once finished setting up the environment, click the Create Environment button at the bottom of the page which will take you back to the worker setup form.
Ensure the newly-created
nemo-2409-inference
environment is selected in the Environment section.In the Runtime settings section for the environment, set the command to:
bash -c
And the arguments to:
'cd /opt/NeMo && git checkout main && git pull && pip install lightning && python3 scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /nemo-workspace/mixtral_8x7b/training_mixtral_8x7b/results/checkpoints/megatron_mixtral.nemo --tensor_parallelism_size 4 --start_rest_service True --triton_model_name mixtral-8x7b --model_type mixtral --max_input_len 4096 --max_output_len 8192'
Note, if deploying a different model, the settings above can be changed to fit your model. These settings are as follows:
--nemo_checkpoint
: The path to your.nemo
file to deploy.--tensor_parallelism_size
: The number of GPUs to deploy on. Larger models will require additional GPUs. In general, for every 1 billion parameters the model has, you will need 2 GB of total GPU memory. Increasing tensor parallelism size will split the required memory amongst GPUs.--triton_model_name
: What the model should be deployed as with Triton. This will be used while sending requests.--model_type
: The model type, such asmixtral
orllama
.--max_input_len
: The maximum number of tokens to allow in the input prompt.--max_output_len
: The maximum number of output tokens to generate for responses.
Add two environment variables in the Runtime settings section:
Add your Hugging Face token with
HF_TOKEN
as the key and your token as the value.Specify the port to listen to Triton requests with key
TRITON_PORT
and value8000
.
In the Compute resource section select the compute type that includes four GPUs as the model needs four GPUs to fit in GPU memory. If using a different model, the number of GPUs should match the tensor parallelism size.
In the Data sources section select the PVC that was created earlier in this example. This will mount the PVC at
/nemo-workspace
inside the pod. Click the Create Inference button at the bottom of the page to create the deployment.
Back in the Workloads page you will see the newly created inference workload. It will take some time for the model to be converted to TRT-LLM engines before it transitions to the Running state. Once the deployment is running, it can start to handle requests.
6.10.6.1. Sending Requests to Deployed Model
The easiest way to send a request to the deployed model is via curl
in a terminal from an IP address in the cluster’s allowlist. The basic request structure is as follows:
curl -X POST https://X.X.X.X/v1/chat/completions/ \ -H 'content-type: application/json' \ -H 'accept: application/json' \ -d '{"prompt": "Write me a short story about a baby dragon that learns to fly", "model": "<trt-model-name>", "max-tokens": 2048, "top_p": 0, "top_k": 0.9, "temperature": 1.0}'
To find the URL, run kubectl get ksvc
in a terminal with the kubeconfig for the cluster configured. Find the knative service that corresponds to the deployment. The URL will be in the second column. In the following example, the URL would be https://mixtral-8x7b-base-model-deploy-runai-demo-project.inference.<cluster>.ai
.
$ kubectl get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON mixtral-8x7b-base-model-deploy https://mixtral-8x7b-base-model-deploy-runai-demo-project.inference.<cluster>.ai mixtral-8x7b-base-model-deploy-00001 mixtral-8x7b-base-model-deploy-00001 True
In the above curl
command, replace X.X.X.X
with the service IP address captured in the previous step. Additionally, replace Write me a short story about a baby dragon that learns to fly
with your prompt of choice and <trt-model-name>
with the name of your TRT model specified during deployment. This command will generate 2048 tokens but this can be changed as needed depending on the prompt.
After submitting the command, it will be passed to the deployed model which will generate a response to the prompt.
The response should look similar to the following (response truncated - actual responses will vary):
{"output":"and having adventures.\nAsked by: Dayanida (6 years, 4 months ago)\nEdit: I am drawing it with Paint Tool SAI and Photoshop CS3.\nUpdated to try and get better.\nAnswered by: Rebecca (12 years, 5 months ago)\nWrite me a story about an adventure in the land of Wandreon where you can choose your own adventure..."}
The model’s response will be in the output
key and will follow directly after the last token in the prompt. For example, combining the end of the input prompt and the start of the response would be “…that learns to fly and having adventures…”
6.10.6.2. Cleaning Up
When the deployment is no longer needed, it can be stopped to free up additional compute resources.
To stop the job, go to the Worloads page on Run:ai and select the select the mixtral-8x7b-base-model-deploy
job and click the Delete button towards the top-left of the panel.
6.11. Using BioNeMo Framework for ESM-2nv Data Preprocessing and Model Training
This tutorial provides an example use case of training a BioNeMo Large Language model using the BioNeMo framework. The tutorial will give users experience in:
Preprocessing the UniRef50 and UniRef90 data for ESM-2nv
Pretraining and continuing training from a checkpoint for ESM-2nv
Launching interactive and training workloads on Run:ai on DGX Cloud
For more information on the NVIDIA BioNeMo framework and the ESM-2nv model, refer to the BioNeMo documentation.
6.11.1. Prerequisites and Requirements
The following are required before running the interactive NeMo job:
You must have accepted an invitation to your NGC org and added your NGC credentials to the Run:ai. Refer to Accessing Your NGC Org.
You must have the user role of L1 researcher, ML Engineer, or System Administrator to run through all sections of this tutorial.
Your user must be able to access a project and department.
At least one A100 or newer GPU with 80 GB of memory.
You must have access to a compute resource with at least one GPU created in your scope that you can use.
6.11.2. Creating the Data Source
We will create a PVC Data Source to share the resulting dataset with future workloads.
To create a new PVC, go to the Data Sources page. Click New Data Source and then PVC to open the PVC creation form.
On the new form, set the desired scope.
Important
PVC Data Sources created at the cluster level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.
Give the PVC a memorable name like
ems2-workspace
and add a description if desired.For the data options, select a new PVC storage class that suits your needs according to the PVC recommended storage classes. In this example,
zonal-rwx
is sufficient. To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode. Enter a capacity that will fit your target dataset—we will specify10 TB
for this example. Select Filesystem as the volume mode. Lastly, set the Container path to/esm2-workspace
, where the PVC will be mounted inside containers.Once the form is complete, click the CREATE DATA SOURCE button at the bottom of the page.
Note
The first time a job is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.
6.11.3. Creating an Interactive Workspace for Preprocessing
With the PVC created, we can configure the workspace to tell Run:ai how the job should be launched. To do so, go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Workspace. You will be taken to the New workspace creation page.
Select the desired project to run your job in.
Leave the Template pane of the form set to Start from scratch.
Enter a descriptive name for your Workspace, such as
esm2-preprocessing
. Click Continue. After a few seconds the Environment pane of the creation process will appear.The environment is a template with common settings that can be reused for similar jobs. For example, an environment might be based on a PyTorch container and have Jupyter exposed as a tool enabling anyone who needs to run interactive PyTorch jobs inside a Jupyter notebook to use the same template instead of launching unique jobs and re-configuring the settings each time they want to experiment. In this example, a new environment will be created based on the NeMo container and expose Jupyter as a tool to make it easier to reuse the template in the future.
To create a new environment for the BioNeMo job, click the + New Environment button in the top right of the Environment pane. The Environment creation form will open.
In the Environment creation form, enter a name for the environment, such as “bionemo-jupyter” and optionally add a description.
Under Image URL, put
nvcr.io/nvidia/clara/bionemo-framework:1.9
. This pulls the BioNeMo container from NGC.Note
If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check for updates to the image.
Under the Workload architecture & type pane, select Standard and Workspace if they are not selected already. These options are likely already selected and grayed out.
Click the Tools pane to open the tools menu, and click +Tool. This list allows you to add additional tools to the workspace to make it easier to connect to their resources, such as Jupyter for connecting to external Jupyter lab sessions, VS Code for connecting to VS Code servers inside the job, and more.
Since we want to run an interactive Jupyter notebook, select Jupyter in the Select tool list. The Connection type drop-down should show External URL, Auto generate, and Container port “8888”.
Click the Runtime settings pane, then click to expand the commands and arguments pane. This section allows us to override the default container run settings as needed. Since we want to launch a Jupyter lab session, enter
jupyter-lab
as the command, and--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token='' --ServerApp.allow_remote_access=true --allow-root --port=8888 --no-browser
for the arguments. This tells Run:ai how to connect to the Jupyter lab session in the UI.Click Create Environment in the bottom right of the page. You should see a pop-up saying that your environment has been created. The page will refresh, and you will be taken back to the top of the Environment pane of the New workspace creation page. The environment you just created should now be selected.
Go to the Compute resource pane and select a single-GPU compute resource for your environment.
In the Data Sources section, select your
esm2-workspace
PVC.Go to the bottom of the page and click Create Workspace. This will add your job to the queue and it will automatically get scheduled once cluster resources are available.
After creating the workspace, you will be taken to the workload overview page, where you can view the status of your workload. Your interactive workload is ready once the status reads “Running”. This may take a while if the container you chose has not previously been pulled to the cluster and the PVC needs to be provisioned for the first time.
Once the status is “Running”, you can launch the Jupyter lab session by selecting the checkbox next to your workload and clicking CONNECT → Jupyter on the top menu bar. Jupyter lab will open in a new window.
6.11.4. Using Workspace to Preprocess Data for ESM-2nv
To briefly showcase the model training capabilities of the BioNeMo Framework, we will use the UniRef50 and UniRef90 datasets to provide a diverse yet non-redundant set of protein sequences. By using both, the model can learn from a wide range of sequence variants while avoiding redundancy. This helps capture diverse features and patterns relevant to protein function and structure prediction while preventing overfitting and improving generalization. For demo purposes, a portion of the sample datasets is located in ${BIONEMO_HOME}/examples/tests/test_data/uniref202104_esm2_qc
.
Open a terminal tab in your Jupyter lab session.
The data is stored in a zip file, so run the following commands to extract the raw FASTA files and a cluster mapping file into your PVC. You will also save your PVC datapath as
DATASET_DIR
for future steps.export ZIP_FILE=${BIONEMO_HOME}/examples/tests/test_data/uniref202104_esm2_qc_test200_val200.zip export DATASET_DIR=/esm2-workspace/test_data unzip $ZIP_FILE -d $DATASET_DIR/
The
mapping.tsv
file is used to associate protein sequences with their respective clusters. This helps to reduce redundancy, organize data, and evaluate model performance by tracking sequence similarity and ensuring diverse training data.Using the unzipped contents of this file, we first create the preprocessed
/train
,/val
, and/test
folders, organizing protein sequences into batch CSV files. It is important to utilize both datasets if you plan to use ESM-2 as originally created.The same approach applies to the clustering mapping file. The ESM2Preprocess class can handle clustering indirectly as part of the dataset preparation process. It leverages UniRef50 to UniRef90 clustering mappings to organize protein sequences, ensuring that data is appropriately clustered for training and validation.
Please note that this script does not perform clustering itself but relies on pre-defined clustering mappings provided in a TSV file format to organize protein sequences. The expected format is a TSV file where the first column represents the cluster ID (FASTA header in UniRef50), and the second column lists the members separated by commas. The members correspond to entries in the UniRef90 FASTA file.
Now, we can run the preprocessing script. The preprocessing steps are:
Download the dataset from a specified URL or NGC registry.
Extract and decompress the downloaded data if necessary.
Index the FASTA file using pyfastx to facilitate data access.
Split the dataset into training, validation, and test sets.
Convert the FASTA sequences into CSV format, dividing them into multiple files if needed.
Generate additional files like memmaps or sorted FASTA files if required for specific use cases.
For more details about the preprocessing steps, please consult the
../bionemo/data/preprocess/protein/preprocess.py
file and the documentation found here.To preprocess the data, use the pretrain.py script and set the
do_training
parameter toFalse
, as shown below. For more information on the command line and YAML configurations mentioned, see the dropdown below.cd ${BIONEMO_HOME} export TRAIN_UF50_FASTA=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta export TRAIN_UF90_FASTA=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200/ur90_ur50_sampler.fasta export TRAIN_CLUSTER_MAPPING_TSV=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200/mapping.tsv export DATAPATH=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200 python examples/protein/esm2nv/pretrain.py \ --config-path=conf \ --config-name=pretrain_esm2_650M \ ++do_training=False \ ++do_preprocessing=True \ ++model.data.val_size=500 \ ++model.data.test_size=100 \ ++model.data.train.uf50_datapath=${TRAIN_UF50_FASTA} \ ++model.data.train.uf90_datapath=${TRAIN_UF90_FASTA} \ ++model.data.train.cluster_mapping_tsv=${TRAIN_CLUSTER_MAPPING_TSV} \ ++model.data.dataset_path=${DATAPATH}
Command Line and YAML Configurations
Parameters starting with
--
are passed as command line arguments topretrain.py
. These parameters include:--config-path
and--config-name
: These specify the folder and the YAML file name for the configuration. The path is relative topretrain.py
. For instance:config-path
: Refers to the configuration folder, e.g., examples/protein/esm2nv/conf.config-name
: Refers to the YAML configuration file, e.g., pretrain_esm2_650M.yaml.The full path for the configuration file in this example would be:
{BIONEMO_HOME}/examples/protein/esm2nv/conf/pretrain_esm2_650M.yaml
. Parameters starting with ++ are configurable within the YAML file. Below are some examples of such parameters found in thepretrain_esm2_650M.yaml
file, which inherits from base_config.yaml:do_training
: Set to False if you only want to preprocess the data without initiating training.model.data.val_size
andmodel.data.test_size
: These specify the sizes of the validation and test datasets, respectively.model.data.train.uf50_datapath
: Specifies the path to the UniRef50 FASTA file.model.data.train.uf90_datapath
: Specifies the path to the UniRef90 FASTA file.model.data.train.cluster_mapping_tsv
: Specifies the path to the mapping file that maps UniRef50 clusters to UniRef90 sequences.model.data.dataset_path
: Specifies the path to the output directory for the preprocessed UniRef50 and UniRef90 data. After processing, the following directories will be created:uf50
: Contains train/test/val splits, each with files likex000.csv
.uf90
: Contains a folder named uf90_csvs, with files likex000.csv
. Note that there will be no train/test/val splits in this directory, as UniRef90 is only used during training.Changes can also be made directly to the YAML file instead of overwriting arguments through the command line.
Now we are ready for pretraining on a single node. For multi-node training, continue with the following step.
For multi-node training, we need to create a run script in our PVC. We’ll launch a
run.sh
script, which will run thepretrain.py
script usingtorchrun
, a tool that runs a multi-process application where each process is assigned a unique rank. The PyTorch Training Operator available on DGX Cloud coordinates with torchrun to automatically set theRANK
,LOCAL_RANK
, andWORLD_SIZE
, among other environment variables, based on the total number of GPUs allocated for a job. Using the terminal tab, create the run script in your PVC. This code will run a two-node distributed training job. To change the number of nodes, modify the++trainer.num_nodes
configuration.cat << EOF > /esm2-workspace/run.sh #!/bin/bash cd \${BIONEMO_HOME}/examples/protein/esm2nv/ && torchrun pretrain.py \ --config-path=conf \ --config-name=pretrain_esm2_650M do_training=True ++trainer.max_steps=1000 ++trainer.val_check_interval=100 \ ++trainer.devices=8 \ ++trainer.num_nodes=2 \ ++model.data.dataset_path=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200 \ ++model.micro_batch_size=2 \ ++trainer.val_check_interval=12 \ ++exp_manager.create_wandb_logger=False \ ++exp_manager.exp_dir=/esm2-workspace/results EOF
Note
To continue training from an existing checkpoint, use
exp_manager.resume_if_exists=True
.To use Weights and Biases for experiment management and training visualization, set
++exp_manager.create_wandb_logger=True
. YourWANDB_API_KEY
environment variable will also need to be added when creating your workload to allow WandB to login. Refer to Using WandB with a Workspace.
6.11.5. Pretraining
Now we will perform pretraining of ESM-2 from scratch using our prepared data and the parameters provided in the pretrain_esm2_650M.yaml config file located in the ${BIONEMO_HOME}/examples/protein/esm2nv/conf
folder. This can be done inside the Jupyter notebook or using distributed training.
6.11.5.1. Single GPU Training in Jupyter Notebook
For this demo example, we will shorten the time required for training by setting the following parameters: ++trainer.max_steps=1
and ++val_check_interval=1
. Users can update these parameters by editing the .yaml
config file or by overriding config arguments at runtime using Hydra, as shown in the example below.
python examples/protein/esm2nv/pretrain.py \ --config-path=conf \ --config-name=pretrain_esm2_650M \ name=esm2_pretrain \ ++do_training=True \ ++model.data.dataset_path=/esm2-workspace/test_data/uniref202104_esm2_qc_test200_val200 \ ++exp_manager.exp_dir=/esm2-workspace \ ++trainer.devices=1 \ ++model.micro_batch_size=1 \ ++trainer.max_steps=1 \ ++trainer.val_check_interval=1 \ ++exp_manager.create_wandb_logger=False \ ++trainer.limit_train_batches=1 \ ++trainer.limit_val_batches=1
You can change the configuration used to pretrain_esm2_8M
if you have hardware constraints. See explanations for the YAML configuration below.
YAML Configuration
trainer.devices
: Specifies the number of GPUs per node to use.
trainer.max_steps
: Sets the maximum number of training steps.
trainer.val_check_interval
: Determines how often to run validation.
trainer.limit_train_batches
andtrainer.limit_val_batches
: Limit the number of batches for training and validation, respectively.
model.micro_batch_size
: Refers to the number of samples processed per rank in a single forward/backward pass before performing a weight update.
6.11.5.2. Multi-Node Distributed Pretraining
To create the training job, navigate to the Workloads page and click New Workload > Training.
On the project creation form page, select the desired project to run the job on.
Select Distributed for the workload architecture to run a multi-node job. This adds a drop-down menu which lists the available frameworks for running distributed jobs. Since we are running a PyTorch job, select the PyTorch framework.
For the distributed training configuration, there is an option for either Workers & master or Workers only.
Note
In this documentation, we will use the term ‘primary’ instead of ‘master’ to align with modern terminology. Please note that the UI and commands may still refer to it as ‘master’.
A primary node is typically required when nodes need to coordinate with one another, such as in a job that does an all_reduce operation. The primary node can either be identical to the workers and run the training process or be a lightweight pod that only does coordination. If a job doesn’t need coordination between processes, the Workers only option can be used. For our example, we want the pods to coordinate with each other. Select the Workers & master option.
Ensure the Start from scratch option is selected for the template.
Give the job a name like
ems2nv-pretraining
and click Continue.In the new environment form that opens, select New environment. In the new form, provide a name for the environment, such as “distributed-bionemo-env”, and optionally add a description.
Under Image URL, put
nvcr.io/nvidia/clara/bionemo-framework:1.9
. This pulls the BioNeMo container from NGC. If desired, the image pull policy can be changed. The default value of “if not already present” is recommended for most cases, but if you are pushing new containers to the NGC org with the same tag, “always pull the image from the registry” should be used to check for updates to the image.Most of the remaining settings on the form can be kept as default. We will change the runtime settings in the worker to keep the environment flexible for future use.
After selecting the Create Environment button, you will be returned to the job creation page with the new environment selected.
In the Commands and Arguments field, enter
sh /esm2-workspace/run.sh
for the command. This will run the distributed training job using the run script we created during preprocessing.In the compute resource pane, select the desired number of workers. Note that since we are using a primary and worker configuration where we want the primary to participate in the training, we will specify one fewer worker node than the total number of nodes expected for the job. In other words, if we want to run an 8-node job where the primary will be coordinating processes as well as training, we will specify seven workers and one primary (which is the next form). In our example, we will train with two total nodes. As a result, select one worker in the form.
For the compute resource, we want to train on a full DGX node with all available GPU resources. Select the compute resource in your project with a GPU capacity of eight.
In the Data Sources pane, select the name of the PVC that was created in the previous section. The name of the PVC in this example is
esm2-workspace
. Click Continue to navigate to the primary configuration page.As mentioned, the primary pod can have a different configuration from the worker pods for distributed PyTorch jobs. For our example, we want to use the same setup for workers and the primary so ensure the Allow different setup for the master radio button is deselected.
Once finished, click Create Training to queue the training job.
After creating the training workload, the job will be scheduled with Run:ai and launched once resources become available. The job will appear in the Run:ai Workloads page after submission. After completion, the model checkpoints and results will be stored in your PVC for future use in other downsteam tasks such as fine-tuning or inference. For more information, refer to ESM-2nv Model Overview.