Release Notes#
NVIDIA Run:ai on DGX Cloud (Current)#
The latest release represents fundamental shifts from the previous Create-branded release of the platform.
Feature#
Fractional GPU Support
See the existing public documentation here.
Kubernetes Cluster Administrator Access
Administrators of the clusters will be granted the cluster-admin role. The customer is fully responsible for their own configurations in their tenant cluster.
Changes#
Storage Management
The RunaiDGXCStorage API has been renamed to NvStorage, but the interface remains the same. Usage instructions remain the same as described at the Storage User Guide.
Shared Memory and High-Performance Networking (EFA) Configuration
Previously, pods resulting from MPIJob, PytorchJob, or RunaiJob resources that requested a full node worth of GPUs were auto-mutated by the system to add resource requests for EFA and Hugepages. Now, the system no longer auto-mutates these workloads to add these resource requests.
NVIDIA Run:ai provides the Compute Resources asset type. A compute resource is a preconfigured building block that encapsulates all the specifications of compute requirements for a workload, like CPU, GPU, and memory. In addition to those, compute resources can specify additional requirements, called Extended Resources. Researchers can use this feature to request EFA and Hugepages for their workloads. For example:
hugepages-2Mi: 5Gi
vpc.amazonaws.com/efa: “32”
External Access to In-Cluster Workloads
For external access to in-cluster workloads (for example, NIM inference), this configuration will be manually set up by the NVIDIA Run:ai engineering team.
Removed#
Cluster Network Policy Configuration
DGX Cloud Create had a self-service API for configuring the Kubernetes cluster network policy. This is no longer available or supported. Instead, as part of the cluster migration process, the engineering team will be responsible for migrating and adding the relevant CIDRs to the cluster network policy on your behalf. Future changes will be done by the engineering team as well.
Exporting Kubernetes API Server Audit Logs
DGX Cloud Create had a self-service API for exporting the Kubernetes API server’s audit logs to an S3 bucket of your choice. This is no longer available or supported. Instead, Kubernetes API Server audit logs will be provided upon request.
SaaS#
For the full list of the features and bug fixes in the NVIDIA Run:ai SaaS platform itself, please refer here.
1.3.1#
Feature#
This release supports a cluster upgrade path for AWS customers. This includes AWS customers who make use of the Private Access feature.
This release supports GCP cluster rebuilds only. DGX Cloud Create customers on GCP will require a new Kubernetes cluster provided by NVIDIA.
Fixed#
[NGCC-25312] Fixed a infrastructure issue when taking down AWS clusters which blocked firewall deletion.
[NGCC-25338] Addressed an issue where serviceAccounts were not found in a cluster spec manifest file.
[NGCC-25340] Fixed a DNS issue in Private Access that prevented a jupyter-tensorboard workload from starting.
[NGCC-25352] Fixed an issue in Private Access with VPC Peering that was blocking access to a pod using the “runai bash” command.
[NGCC-25454] Addressed an issue where the gpu-operator version did not match the expected version in the deployment manifest.
[NGCC-25470] Fixed a problem with Private Access that caused newly created jobs to remain in Pending status.
Known Issues#
Only H100 nodes are supported for AWS and GCP, not GB200 NVL72.
NVCF deployment is not supported.
1.3.0#
Feature#
This release supports AWS cluster rebuilds only. DGX Cloud Create customers on AWS will require a new Kubernetes cluster provided by NVIDIA.
Run:ai Agent 2.21 new features as listed here.
PVC data replication is now supported in DGX Cloud Create. This allows teams of AI researchers to easily share data across departments and projects. Please refer to the NVIDIA Run:ai documentation on Data Volumes for more information. Also, there are important notes for protecting the replicas in DGX Cloud Create in the Data Volumes in Run:ai on DGX Cloud section.
The Kubeflow version has been upgraded to 1.9.1.
The NVIDIA GPU Operator is 25.3.0 with its features and fixes described here.
Note that the version of NVIDIA GDRCopy has been set to 2.5.
Updated NVIDIA Data Center GPU drivers:
AWS: 570.133.20
GCP: 570.124.06
Kubernetes versions: 1.30
Updated AWS AMI to the 6.8 LTS kernel.
Fixed#
Run:ai Agent 2.21 fixes as listed here.
[NGCC-24989] Addressed an issue where a NIM could not start. This was due to a limitation where knative could not use multiple secrets to pull the image from nvcr.io.
Known Issues#
Only H100 nodes are supported for AWS and GCP, not GB200 NVL72.
NVCF deployment is not supported.
1.2.2#
Feature#
Storage Control feature as documented in the Storage User Guide. This feature provides:
Storage class encapsulation:
a standard set of storage classes to encapsulate shared storage across the CSP storage variants, allowing customers to clearly understand their quota management
Support for object storage mounts:
an API for mounting object storage into customer workloads
Volume protection:
an API for the retention and deletion of provisioned storage volumes
Run:ai Agent 2.20.38 features and fixes as listed here.
Fixed#
[NGCC-23214] Fixed an issue where a secret could not be mounted as a volume.
[NGCC-23229] Fixed an issue where a
runaidgxcnetworkpoliciesresource would have an error if no egress policy was explictly set.[NGCC-23232] Fixed an issue with unhandled XIDs sent to Grafana.