4. Cluster User Guide
Welcome to your Run:ai on DGX Cloud cluster! This section of the guide is targeted at practitioners who want to run workloads on the cluster. For complete Run:ai documentation, refer to the Run:ai Documentation Library.
4.1. Logging Into Your Cluster
There are two ways to interact with a Run:ai on DGX Cloud cluster:
Using the Run:ai UI: Your cluster administrator should provide you with a URL to access the cluster. Use this URL to log in to the Run:ai UI.
Using the Run:ai CLI: This option is limited to certain user types. To use the Run:ai CLI, you must first log in using the UI. To learn how to set up the CLI, visit Accessing the Run:ai CLI in Advanced Usage.
4.1.1. Accessing the Run:ai UI
To access the Run:ai console in a browser, follow these steps:
Go to the cluster URL given to you by your cluster admin.
In the login dialog, click Continue with SSO. Your SSO login dialog will open.
Enter your details. Once logged in, you will be taken to the Run:ai Cluster Overview page.
4.2. Understanding Your Role and Scope
When working in Run:ai, your user roles determine the actions you can perform within the cluster. This section explains how to identify your role and understand your permissions. You may be assigned multiple roles within a single cluster.
4.2.1. Learning Your User Role(s)
Your cluster administrator should have informed you of your assigned role(s) when you first received the cluster URL. If you haven’t received this information, contact your cluster administrator.
To view the permissions and actions allowed by your user type, visit the Run:ai RBAC documentation.
4.3. Walking Through Your Run:ai Environment on DGX Cloud
This section covers the key concepts of Run:ai on DGX Cloud, focusing on areas which are specific to the DGX Cloud-based cluster. For more information, refer to the full Run:ai Documentation Library.
4.3.1. Overview
After logging into the Run:ai UI, you will be taken to the Run:ai Overview page.
The Overview page displays key metrics about your cluster, including resource utilization and the number of allocated GPU devices. The name of your cluster is listed in the top navigation bar.
The left navigation menu provides quick access to different sections of the UI. Use it to navigate to the Projects, Workloads, Departments, and Nodes pages, among others, as discussed below.
For more information about analytics, quota management, and nodes, visit the Run AI documentation.
4.3.2. Departments and Projects
Visit Run:ai Departments for a detailed introduction to departments and projects.
4.3.2.1. Departments
Departments allow you to associate quotas with different teams and groups of users. Users are members of a department, and all projects must be created within a department.
Note
Only users with the role of Application Administrator and Editor can create departments.
4.3.2.1.1. Creating a Department
For instructions on creating a department, refer to Creating a Department section of the Cluster Administrators guide.
4.3.2.2. Projects
Projects are used to allocate resources, and define clear guardrails between different research initiatives. Groups of users (or, in some cases, individuals) are associated with a project and can run workloads within that project against a fixed project allocation.
All projects are associated with a department. It is important to take note of the department’s quota limits, as these will impact the allocation and quota of any project associated with that department.
Multiple users can be scoped into the same project. Project-level information, including credentials, is visible to all users in that project.
4.3.2.2.1. Creating Projects
Note
Only users with the role of Application Administrator, Department Administrator, Editor and Research Manager can create and manage projects within their scope on the Run:ai cluster.
Follow Creating a Project in the Cluster Administrators guide to create a project.
4.3.2.2.2. Editing an Existing Project
If your permissions allow, you can update an existing project by clicking on the checkbox to the left of the project name and then clicking Edit on the menu bar at the top of the projects overview page. This will take you back to the project creation page where you can update any entries as required.
You can update user access for the project by clicking on the checkbox to the left of the project name, then clicking Access Rules on the menu bar at the top of the projects overview page. This will bring up a pop-up where you can enter additional user email addresses to grant them access to the project.
4.3.3. Workloads
A workload is a computational job or task that you submit to the cluster for execution. There are two types of workloads enabled on your Run:ai on DGX Cloud cluster:
Workspaces - interactive environments best suited for development and exploration of data and code.
Trainings - distributed training jobs which are not interactive.
Note
Inference is not enabled in the cluster, and the deprecated Jobs workload has not been validated.
Workloads run within a project scope and can consume a range of resources. Detailed information about these resources is available in the next section of this guide. Here is a brief overview of the resources:
Environments - A workload must have an environment. An environment acts as a container, holding libraries, requirements, and optionally code and data.
Compute Resources - A workload requires allocated compute resources, which determine the number of CPUs and GPUS it can utilize.
Volumes - Volumes can be associated with a job, giving access to your data. They are dedicated to specific workloads and cannot be transferred.
Data Sources - You can connect to a range of data sources including Git and Persistent Volume Claims (PVCs). Data sources are created separately from workloads, and can be used between different workloads, if desired.
Credentials - Credentials scoped at the project level will automatically be injected into any workload running in the project.
Templates - Templates allow you to pre-fill the workload creation form and save it for use by users in the project at a later date. This is useful for frequently used workloads.
In the next sections, we cover each of these workload components in more detail.
We also provide full examples of creating and launching a variety of workloads in the Workload Examples guide.
4.3.4. Environments
Run:ai environments specify container URLs as well as image pull policies and entry commands for the containers. All workloads must specify an environment.
When creating an environment, you can specify the scope, indicating which departments and projects can view and use it.
4.3.4.1. Creating a New Environment
Note
Only users with the role Application Administrator, Department Administrator, Editor, Environment Administrator, L1 Researcher and Research Manager can create environments.
To create a new environment using the Run:ai UI:
In the left navigation menu, select Environments. You will be taken to the Environments overview page which shows information on existing environments in the cluster.
Click + NEW ENVIRONMENT in the top left of the page. You will be taken to the New environment creation page.
Select a Scope for the environment. This determines which clusters, departments, groups or projects can deploy the environment.
Enter an Environment name and description.
Insert the URL for a container image, and select the pull policy for the image.
Select the architecture and type of the workload.
(Optional) Select connections for your tools.
(Optional) Under the Runtime settings drop-down menu, set any commands and arguments for running the container in the pod. Add any environment variables and a working directory for the container.
Note
When using a container image, it’s important to be aware of your container’s launch command or entry point.
(Optional) Under the Security menu, set additional Linux capabilities for the container and indicate whether the UID, GID and groups should be taken from the image or from a custom location.
Click CREATE ENVIRONMENT. You will be taken back to the Environments overview page where your environment is listed in the table.
4.3.4.2. Editing an Environment
Note
Only users with the role Application Administrator, Department Administrator, Editor, Environment Administrator, L1 Researcher and Research Manager can edit environments.
You can edit an existing environment by creating an ‘edited copy.’ To edit an environment, from the Environment overview page in the UI:
Click on the environment you wish to edit. The menu bar will appear at the top of the page.
Click COPY & EDIT. You will be taken to the New environment overview page, which will be pre-filled with the selected environment details.
Edit the environment as you wish, then click CREATE ENVIRONMENT.
4.3.5. Compute Resources
In Run:ai, a compute resource refers to a resource request consisting of CPU devices and memory, and optionally GPUs and GPU memory. When a workspace is requested, with a specific compute resource, the scheduler searches for those resources. If available, the workspace will be launched with access them.
Note
On your Run:ai on DGX Cloud cluster, you cannot use partial GPUs. Therefore, avoid selecting or creating compute resources which use partial GPUs.
4.3.5.1. Creating a New Compute Resource
Note
Only users with the role Application Administrator, Compute Resource Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create new compute resources.
To create a new compute resource:
Navigate to the Compute Resources overview page using the left navigation menu.
Click + NEW COMPUTE RESOURCE. You will be taken to the New compute resource creation page.
Set a Scope for the resource. All users with access to create workloads within that scope will be able to use that compute resource.
Give your compute resource a name and description.
Under the Resources section of the page, select the number of CPUs and GPU devices per pod, as well as memory for both CPU and GPU devices.
Note
Partial GPUs is not supported on your Run:ai on DGX Cloud cluster.
Once you have filled in the page, Click CREATE COMPUTE RESOURCE. You will be taken to the Compute resource overview page where you can see your new compute resource in the list of resources.
4.3.6. Storage
Within Run:ai, there are two types of storage:
Data Sources - persistent storage which can be attached to workloads.
Volumes - non-persistent storage which is created at the launch of the workload and persists only while the workload is running.
In the next sections, we will cover data sources and volumes in more detail and give examples of using these storage types to access and read data.
DGX Cloud provides a GCP Filestore for persistent storage in the current release. The Filestore has several storage classes available to users.
4.3.6.1. Data Sources
Data sources are persistent general-purpose storage or connections to external data, like a Git repository.
The data sources supported by DGX Cloud are:
Persistent Volume Claim (PVC) (see details below)
Git
ConfigMap
DGX Cloud does not support these data sources:
NFS
S3 Bucket
Host path
When creating any data source, it needs to be assigned a scope, defining which projects and departments can use or connect to the data source. The following section gives more details on the supported data sources.
4.3.6.1.1. PVC
A Persistent Volume Claim (PVC) is a Kubernetes request for storage capacity initiated by a user. In the context of a Run:ai data source, the PVC is a pool of storage that can be provisioned against a set of available storage types in the DGX Cloud deployment, with varying capabilities and capacity.
PVC data sources should be used conceptually as a high performance shared directory for storing data that needs to be accessed across the cluster for your workloads.
Important
Creating a PVC counts against the storage quota available for your cluster. Care should be taken to ensure that each PVC is used efficiently, to not exhaust your total available storage quota too quickly.
Recommended Storage Classes
The following PVC storage classes are recommended for use in DGX Cloud. Their capabilities and performance characteristics are presented here.
Service Tier |
GKE Storage Class |
GKE PVC Size |
Deployment |
Access Mode |
---|---|---|---|---|
Zonal |
zonal-rwx |
10-100 TiB |
Zonal |
read/write/many |
The following storage classes will work, but are not recommended for use:
Service Tier |
GKE Storage Class |
GKE PVC Size |
Deployment |
Access Mode |
---|---|---|---|---|
Basic HDD |
standard-rwx |
1-63.9 TiB |
Zonal |
read/write/many |
Basic SSD |
premium-rwx |
1-63.9 TiB |
Zonal |
read/write/many |
Enterprise |
Enterprise-rwx |
1-10 TiB |
Regional |
read/write/many |
Enterprise |
Enterprise-multishare-rwx |
10 GiB-1TiB |
Regional |
read/write/many |
All other PVC storage classes (none, standard, standard-rwo, premium-rwo) are unsupported and their use will result in job failures.
Note
For information on total storage capacity available for each storage class, or current storage utilization, please reach out to your NVIDIA TAM.
Creating a PVC
Note
Only users with the role Application Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create a PVC.
To create a PVC from the Run:ai UI:
From the Data Sources overview page, click + NEW DATA SOURCES. A drop-down menu will appear.
From the drop-down menu, select PVC. You will be taken to the New data source creation page.
Set a Scope for the PVC, and enter a name and description.
Fill out the Data mount section of the form:
Select a Storage class. Be sure to review the DGX Cloud recommended storage classes.
Select the access mode configuration for the PVC - either read/write by one node, read only by many nodes, or read/write by many nodes.
Specify a claim size to ensure a minimum capacity for the PVC.
Note
The PVC may be larger than necessary, depending on the minimum partition size of the underlying storage. Refer to the table above when provisioning storage greater than 1 TiB to ensure compatibility with the target storage class.
Choose the Filesystem option as the Volume mode.
Note
The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.
Specify a Container path to define what path the PVC will be accessible from in a running job.
Note
Do not specify a value of
/scratch
, as this will conflict with default storage provisioned for GPU-based compute resources.(Optional) In the Restrictions pane, you can use the toggle switch to make the storage read-only if desired.
Note
This is an alternative to the PVC access mode configuration.
Click CREATE DATA SOURCE. You will be taken to the Data sources overview page, where you can view your new PVC data source.
Note
When creating a new data source, you can also select an existing PVC. This PVC can be created using the K8s API or kubectl
on the cluster directly. For more information on interacting with the K8s cluster, refer to Setting up Your Kubernetes Configuration File.
4.3.6.1.2. Git
Git is a version control system (VCS) used to manage iterative code modification and collaboration within a software project. As a Run:ai data source, it takes a public or private git repository and makes it available inside a workload as a filesystem path.
Creating a Git Data Source
Note
Only users with the role Application Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create a Git data source.
To create a Git data source within your Run:ai cluster using the UI:
From the Data sources overview page, click + NEW DATA SOURCE. A drop-down menu will appear.
From the drop-down menu, select Git. You will be taken to the New data source creation page.
Set the scope for the Git data source.
Enter a name and a description into the relevant fields.
Fill in the URL for the Git repository, and optionally specify the branch.
Select relevant credentials to the data source. These are required if the Git repository is private. (You can create the necessary credentials in the Run:ai Credentials web interface).
Set the Container path. This defines the path from which the Git data source will be accessible in a running job.
Note
Do not specify a value of
/scratch
, as this will conflict with default storage provisioned for GPU-based compute resources.Click CREATE DATA SOURCE. You will be taken to the Data sources overview page, where your new data source will be shown in the table.
4.3.6.1.3. ConfigMap
A ConfigMap is a Kubernetes object used to store non-confidential data as key-value pairs. As a Run:ai data source, it takes a ConfigMap that exists in the Run:ai Kubernetes cluster and makes it available inside a workload as a file.
Note
Only users with the role Application Administrator, Data Source Administrator, Department Administrator, Editor, L1 Researcher and Research Manager can create a Configmap.
To create a Configmap in your Run:ai cluster from the UI:
From the Data sources overview page, click NEW DATA SOURCE. A drop-down menu will appear.
From the drop-down menu, select ConfigMap. You will be taken to the New data source creation page.
Select a Scope for the ConfigMap.
Note
ConfigMaps can only be scoped at the individual project level. You cannot select the whole cluster or a department as the scope for a ConfigMap.
Enter a name and description.
From the ConfigMap name drop-down menu, select a ConfigMap which exists on your Run:ai Kubernetes cluster.
Set the Container path, which defines what path the ConfigMap will be accessible from in a running job.
Note
Do not specify
/scratch
as the path, as this will conflict with the default storage provisioned for GPU-based compute resources.Click CREATE DATA SOURCE. You will be taken to the Data sources overview page, where the new ConfigMap will appear in the table.
4.3.6.2. Volumes
Volumes are workload-specific storage in Run:ai. At workload creation time, a user can create a new volume. Most configuration choices for a volume mirror those documented in the PVC Data Source section. Additionally, a user can specify the volume’s contents as either persistent or ephemeral. A persistent volume retains all data written to it until the workload it is attached to is deleted. An ephemeral volume’s contents are removed every time a workload is stopped.
Workloads can use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods through the container’s writable layer, logs directory, and EmptyDir volumes. The data in an EmptyDir volume is preserved across container crashes.
When a user submits a job which requires a GPU, an EmptyDir on the local SSD of the node is attached to the pod as a scratch disk. The /scratch
directory appears to have 6TiB, which is the total storage on the node; however, each Pod is limited to 200 GiB. If the Pod exceeds this 200 GiB limit, it will be evicted from the node (i.e., terminated).
CPU-only workloads do not get a /scratch
directory provisioned.
4.3.7. Credentials
The Run:ai UI supports credentials for accessing containers and applications that are otherwise gated. The supported credential types on your cluster are:
Docker registries
Access keys
Usernames and passwords
Credentials of type Generic Secret are not enabled on the cluster.
When you add credentials to your cluster, you must select a scope. The credentials are then usable by any workload which is deployed at that scope level. The smallest scope which can be applied is the project-level scope. There is no user-specific scoping available for credentials.
As part of your subscription to Run:ai on DGX Cloud, you are granted access to the NVIDIA GPU Cloud (NGC) Catalog. The NGC Catalog contains GPU-accelerated AI models and SDKs that enable you to infuse AI into your applications at speed of light.
4.3.7.1. Accessing Your NGC Org
Your cluster administrator can invite you to NVIDIA GPU Cloud (NGC). Once you have received the invitation, follow the instructions in the email to set up your account.
4.3.7.2. Setting Up Your NGC API Key
To generate your NGC API key, follow these steps:
Log into NGC.
Click on your user account menu at the top right of the screen and select Setup.
Click Generate Personal Key and generate the key in the new form that opens. Save the displayed key in a safe place as this will only be shown once and is required for future steps.
4.3.7.3. Adding NGC Credentials to the Run:ai Cluster
Note
Only users with the role Application Administrator, Credentials Administrator, Department Administrator, Editor and L1 Researcher can add credentials to the cluster.
To add NGC Credentials to the Run:ai Cluster:
Access the Credentials page from the Run:ai left navigation menu.
Click + NEW CREDENTIALS and select Docker registry from the drop down menu. You will be taken to the New credential creation page.
Select the Scope for your NGC credential. The secret will be usable by any workload launched within the scope. For example, if your scope is set at the department level, all workloads launched in any project associated with that department can use the secret, regardless of which user created the credential, or launched the workload.
Enter a name and description for the credential. This will be visible to any cluster user.
Select New secret.
For username, use
$oauthtoken
.For password, paste your token.
Under Docker Registry URL, enter
nvcr.io
.Click CREATE CREDENTIALS. Your credentials will now be saved in the cluster.
4.3.7.4. Adding Git Credentials
To add Git credentials to your Run:ai on DGX Cloud Cluster, you must first create a Personal Access Token from your Git server, and give access to the desired repositories as part of the creation process. You can create a token on GitHub by following these instructions, and similarly by following these instructions on GitLab.
Note
Only users with the role Application Administrator, Credentials Administrator, Department Administrator, Editor and L1 Researcher can add credentials to the cluster.
Once you have generated a Personal Access Token, add your credentials to your Run:ai cluster by following these instructions:
Access the Credentials page from the Run:ai left navigation menu.
Click + NEW CREDENTIALS and select Username & password from the drop down menu. You will be taken to the New credential creation page.
Select the Scope for your credential. The secret will be usable by any workload launched within the scope.
Enter a name and description for the credential. This will be visible to any cluster user.
Select New secret.
For username, use your Git username.
For password, paste your Personal Access Token.
Click CREATE CREDENTIALS. Your credentials will now be saved in the cluster.
4.3.8. Templates
Templates are used to pre-define commonly used workloads. They make it quick and easy to launch a workload based on pre-set instructions.
Note
Only users with the role Application Administrator, Credentials Administrator, Department Administrator, Editor, L1 Researcher, Research Manager and Template Administrator can add credentials to the cluster.
To create a template, navigate to the Templates overview page from the left navigation menu then:
Click + NEW TEMPLATE. The New Template Creation page will open.
Set the Scope for the template. This determines which departments and projects will be able to use this template.
Enter a template name and description.
Select or create an Environment for your template to use.
Choose a compute resource.
(Optional) Select values for Volumes and Data Sources.
Once you have selected all your desired settings for the template, click CREATE TEMPLATE. You will be taken back to the Templates overview page.
Once created, templates can be selected for use as part of the New Workload creation process.
4.4. Using Your Data From Cloud Storage
There are two ways to use your data residing in Google or other clouds:
Copy the data into the DGX Cloud environment, from Google Cloud Storage (GCS) or other cloud, onto the shared store NFS filesystems.
Access the data directly from services within the cloud provider. We recommend running this job as a pure Kubernetes job at this time.
Here are examples for the following scenarios:
4.4.1. Example: Using GCS to copy data to PVC within Run:ai
In GCP, create a custom GCP role with the required permissions.
The following are a minimal set of permissions, which may be created by cloning the Storage Object Viewer role and adding the
storage.buckets.get
permission.1 storage.buckets.get 2 storage.buckets.getIamPolicy 3 storage.buckets.listEffectiveTags 4 storage.buckets.listTagBindings 5 storage.managedFolders.create 6 storage.managedFolders.get 7 storage.managedFolders.list 8 storage.objects.list
Create a GCP service account (SA).
Assign the custom role to the service account on the target bucket(s).
Create a service account key for the SA created and download the .json key file. Protect the key file as it contains your private key.
Create a Kubernetes secret with the SA key in your Run:ai project (i.e., namespace) using the following command.
Note
You must have your kubeconfig set to the Run:ai researcher configuration provided by DGX Cloud.
kubectl create secret generic google-gcs-creds --from-file=key=projectID-deadbeef.json -n runai-<Run:ai projectName>
Within Run:ai, create a data source as a PVC using the following sequence:
Specify the scope, name, and an optional description for the PVC.
In the Data Mount section, select New PVC. Choose a storage class, access mode, and always Filesystem. Then, set a default mount path within any containers.
Note
The Block Volume mode is unsupported. Selecting it may lead to errors when launching a workload that uses the resulting PVC.
Use the
kubectl get pvc
command to determine the actual name of the PVC created by the Run:ai console. The actual name will be similar toname-for-pvc-<cluster-identifier>-<digits>
. Use this for the next step.Create a Kubernetes manifest YAML utilizing the Google Cloud command line to copy the data from GCS into the PVC. The example below uses the secret you created and the PVC as mounted volumes to the container. The ProjectID and Bucket name are passed to a
gcloud rsync
via environment variables.1 apiVersion: v1 2 kind: Pod 3 metadata: 4 name: gcp-gcs-copy-2-pv-podd 5 namespace: runai-<Run:ai projectName> 6 spec: 7 volumes: 8 - name: pv-storage 9 persistentVolumeClaim: 10 claimName: name-for-pvc 11 - name: gcp-sa-credential 12 secret: 13 secretName: google-gcs-creds 14 restartPolicy: OnFailure 15 schedulerName: runai-scheduler 16 containers: 17 - name: gcp-cli-container 18 image: google/cloud-sdk 19 imagePullPolicy: IfNotPresent 20 env: 21 - name: PROJECTNAME 22 value: projectID 23 - name: BUCKET 24 value: bucket-name 25 volumeMounts: 26 - mountPath: "/data/shared" 27 name: pv-storage 28 - name: gcp-sa-credential 29 mountPath: /etc/gcp_cred 30 readOnly: true 31 command: ["sh","-c"] 32 args: 33 - gcloud auth activate-service-account --key-file=/etc/gcp_cred/key; 34 gcloud storage rsync -r gs://${BUCKET} /data/shared/ --project=${PROJECTNAME};
Submit the Kubernetes manifest for execution in the DGX Cloud environment using the following command:
kubectl apply -f mainifast.yaml -n runai-<Run:ai ProjectName>
.Monitor the job for completion using
kubectl get pod gcp-gcs-copy-2-pv-podd
You can now use the PVC in your Run:ai jobs.
4.4.2. Example: Using data directly within Google Cloud Services
Run:ai training jobs can directly access data in Google Cloud storage. This section gives an example of how to do this in a Python job, and the small configuration steps needed to make it happen.
The following Python code demonstrates how to read all text files from a bucket into a list of strings. While this approach may not be practical for AI training workloads, it will help illustrate the necessary configurations.
1 from google.cloud import storage
2 import argparse
3 import os
4
5 def get_bucket_name():
6 parser = argparse.ArgumentParser(description="Read text files from a Google Cloud Storage bucket.")
7 parser.add_argument("bucket_name", help="The name of the bucket to read from.", nargs='?')
8 args = parser.parse_args()
9
10 if args.bucket_name:
11 return args.bucket_name
12 elif "BUCKET" in os.environ:
13 return os.environ["BUCKET"]
14 else:
15 print("Error: Bucket name not provided. Please provide it as a command line argument or set the BUCKET environment variable.")
16 return None
17
18 def read_bucket_files(bucket_name):
19 storage_client = storage.Client()
20 bucket = storage_client.bucket(bucket_name)
21
22 file_contents = []
23 blobs = bucket.list_blobs()
24 for blob in blobs:
25 if blob.name.endswith(".txt"):
26 # Download the file content
27 content = blob.download_as_string().decode("utf-8")
28 file_contents.append(content)
29
30 return file_contents
31
32 if __name__ == "__main__":
33 bucket_name = get_bucket_name()
34 if bucket_name:
35 file_contents = read_bucket_files(bucket_name)
36 else:
37 exit(1)
38
39 for content in file_contents:
40 print(content)
To execute the above code in a containerized training job on Kubernetes, you will need to do the following:
Include the Google Cloud Storage Python libraries into the container.
Inject a secret into the container, allowing proper authorization to the GCS API.
To install the google-cloud-storage
Python library, use pip
to install it into the container image. For example, use the following Dockerfile:
1 FROM python:3.12
2
3 RUN pip install google-cloud-storage
4 RUN mkdir /app
5 COPY read.py /app/
6
7 ENTRYPOINT ["python"]
8 CMD ["/app/read.py"]
To inject the secret into the container, use a Kubernetes secret mounted as a file with the environment variable GOOGLE_APPLICATION_CREDENTIALS
pointed at the file. The example below uses the same secret as the one created above.
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: gcs-reader
5 spec:
6 containers:
7 - name: gcs-reader
8 image: nvcr.io/<orgID>/gcs-read:latest
9 imagePullPolicy: IfNotPresent
10 command: ["python", "/app/read.py"]
11 env:
12 - name: BUCKET
13 value: bucket-name
14 - name: GOOGLE_APPLICATION_CREDENTIALS
15 value: /etc/gcp_cred/key
16 volumeMounts:
17 - name: gcs-creds
18 mountPath: /etc/gcp_cred
19 readOnly: true
20 imagePullSecrets:
21 - name: nvcr.io-creds
22 volumes:
23 - name: gcs-creds
24 secret:
25 secretName: google-gcs-creds
26 restartPolicy: OnFailure
Note
Currently, there isn’t a method to mount Kubernetes secrets as files in Run:ai jobs. Run.ai supports only environment variables. Since the Google libraries require a token file (which should be a secret), you will need workarounds to use them with Run:ai jobs. The recommended workarounds are:
only use Kubernetes native jobs for data manipulation, or
copy the GCP service account key into shared storage and point the
GOOGLE_APPLICATION_CREDENTIALS
environmental variable at that shared storage location.
4.4.3. Example: Reading data from Google BigQuery
You can use data from Google BigQuery directly in your training jobs on Run:ai. This example shows the Python script and the small configuration code needed.
In GCP, the BigQuery Data Viewer role contains the necessary permissions and may be assigned at the table, dataset or project levels. Following the principle of least privilege, NVIDIA recommends assigning the role at the Table level, unless you need to read multiple tables from the dataset or project. The following is a Python code example for a BigQuery reader.
1 import argparse
2 import os
3 from google.cloud import bigquery
4
5 def read_bigquery_table(project, dataset, table):
6 client = bigquery.Client()
7 table_ref = client.dataset(dataset).table(table)
8 table = client.get_table(table_ref)
9
10 rows = client.list_rows(table)
11 return rows
12
13 if __name__ == "__main__":
14 parser = argparse.ArgumentParser(description='Read a BigQuery table.')
15 parser.add_argument('--project', help='The project ID.')
16 parser.add_argument('--dataset', help='The dataset ID.')
17 parser.add_argument('--table', help='The table ID.')
18
19 args = parser.parse_args()
20
21 project = args.project or os.environ.get('PROJECTNAME')
22 dataset = args.dataset or os.environ.get('DATASET')
23 table = args.table or os.environ.get('TABLENAME')
24
25 if not project or not dataset or not table:
26 print("Error: Missing project, dataset, or table name.")
27 exit(1)
28
29 rows = read_bigquery_table(project, dataset, table)
30
31 for row in rows:
32 print(row)
Include the above code in a container using the following Dockerfile:
1 FROM python:3.12
2
3 RUN pip install google-cloud-bigquery
4 RUN mkdir /app
5 COPY read-bq.py /app/
6
7 ENTRYPOINT ["python"]
8 CMD ["/app/read-bq.py"]
Finally, you can start a job/pod with the following Kubernetes manifest:
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: read-bq
5 spec:
6 containers:
7 - name: read-bq
8 image: nvcr.io/<orgID>/read-from-bq:latest
9 imagePullPolicy: Always
10 command: ["python", "/app/read-bq.py"]
11 env:
12 - name: PROJECTNAME
13 value: project
14 - name: DATASET
15 value: austin_training
16 - name: TABLENAME
17 value: example
18 - name: GOOGLE_APPLICATION_CREDENTIALS
19 value: /etc/gcp_cred/key
20 volumeMounts:
21 - name: gcs-creds
22 mountPath: /etc/gcp_cred
23 readOnly: true
24 imagePullSecrets:
25 - name: nvcr.io-creds
26 volumes:
27 - name: gcs-creds
28 secret:
29 secretName: google-gcs-creds
30 restartPolicy: OnFailure
4.4.4. Alternative to using Service Account Keys
Service account keys are powerful credentials, and can present a security risk if they are not managed correctly. Workload Identity Federation eliminates the maintenance and security burden associated with service account keys. With Workload Identity Federation, you can use Identity and Access Management (IAM) to grant external identities IAM roles, and direct access on Google Cloud resources. You can also grant access through service account impersonation.
As part of environment onboarding, NVIDIA will provide you with the OIDC issuer URL for the environment. Navigate to your organization’s Google Cloud Console, then go to IAM & Admin > Workload Identity Federation. Click the CREATE POOL button to proceed.
In the subsequent workflow wizard, provide a name for the pool and an optional description and select OpenID Connect (OIDC) as the provider.
After selecting the provider type, enter “dgx-cloud” as the provider name and paste the Issuer URL provided with your environment into the Issuer (URL) field. Under Audiences, select Default audience.
Lastly, add a mapping google.subject
to assertion.sub
in the Configure provider attributes step.
Once complete, the Workload Identity Pool will go out and confirm that it can communicate with the “dgx-cloud” Kubernetes cluster. You should see a green checkmark in the upper left corner near the provider you created as shown below.
Then, click CONNECTED SERVICE ACCOUNTS in the upper right corner and then select the DOWNLOAD CONFIG button. When the Configure your application dialog comes up, select the provider you created and then paste the path
/var/run/service-account/token
into the OIDC ID token path field, leaving the format as “text” and then click DOWNLOAD CONFIG to download the required configuration file.
After saving the configuration file, click DISMISS. On the main Identity Federation Pool page, locate and copy the value in the IAM Principal
field. It will look similar to:
principal://iam.googleapis.com/projects/1091797687346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/SUBJECT_ATTRIBUTE_VALUE
where SUBJECT_ATTRIBUTE_VALUE
represents the Kubernetes identifier.
For example, if the service account is default
in the Run:ai project runai-projectname
and we substitute system:serviceaccount:runai-projectname:default
as the SUBJECT_ATTRIBUTE_VALUE
, the overall IAM Principal will be:
principal://iam.googleapis.com/projects/1068792787346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/system:serviceaccount:runai-projectname:default``
This is the IAM Principal to use (within GCP IAM) to grant access to workloads in that Run:ai project as shown below:
1 gcloud projects add-iam-policy-binding projects/PROJECT_ID \
2 --role=roles/container.clusterViewer \
3 --member=principal://iam.googleapis.com/projects/1068792787346/locations/global/workloadIdentityPools/nvidia-dgx-cloud-run-ai-env/subject/system:serviceaccount:runai-projectname:default \
4 --condition=None
Or with the Cloud Console UI:
Once IAM permissions have been provided, a ConfigMap should be constructed on the “dgx-cloud” Kubernetes cluster, with the configuration file you downloaded using the following command.
kubectl create configmap google-wi --from-file clientLibraryConifg-dgx-cloud.json --namespace runai-projectname
Once the ConfigMap has been created, each workload should be augmented with three items, a volume projecting the SA token, a volume mapping the ConfigMap, and an environment variable pointing to the ConfigMap projection. The following example manifest shows a simple container with the three requirements.
Note
The audience property in the serviceAccountToken
below may be copied from the clientLibraryConifg file audience
field (be sure to add the https:
).
1 apiVersion: v1
2 kind: Pod
3 metadata:
4 name: oidc-example
5 spec:
6 containers:
7 - name: example
8 image: google/cloud-sdk
9 command: ["/bin/sh", "-c", "gcloud auth login --cred-file $GOOGLE_APPLICATION_CREDENTIALS && sleep infinity"]
10 volumeMounts:
11 - name: token
12 mountPath: "/var/run/service-account"
13 readOnly: true
14 - name: workload-identity-credential-configuration
15 mountPath: "/etc/workload-identity"
16 readOnly: true
17 env:
18 - name: GOOGLE_APPLICATION_CREDENTIALS
19 value: "/etc/workload-identity/clientLibraryConifg-dgx-cloud.json"
20 volumes:
21 - name: token
22 projected:
23 sources:
24 - serviceAccountToken:
25 audience: https://iam.googleapis.com/projects/<Your PROJECT_ID>/locations/global/workloadIdentityPools/<Your WORKLOAD_POOL>/providers/<Your WORKLOAD_PROVIDER>
26 expirationSeconds: 3600
27 path: token
28 - name: workload-identity-credential-configuration
29 configMap:
30 name: cloud-ce-config
To confirm the configurations, run kubectl exec -it oidc-example -n runai-projectname -- bash
to access the container and gcloud config list
to confirm the configuration.
The GCP resources configured in IAM should be accessible with either gcloud
or gsutil
commands.