Release Notes#
1.3.1#
Feature#
This release supports a cluster upgrade path for AWS customers. This includes AWS customers who make use of the Private Access feature.
This release supports GCP cluster rebuilds only. DGX Cloud Create customers on GCP will require a new Kubernetes cluster provided by NVIDIA.
Fixed#
[NGCC-25312] Fixed a infrastructure issue when taking down AWS clusters which blocked firewall deletion.
[NGCC-25338] Addressed an issue where serviceAccounts were not found in a cluster spec manifest file.
[NGCC-25340] Fixed a DNS issue in Private Access that prevented a jupyter-tensorboard workload from starting.
[NGCC-25352] Fixed an issue in Private Access with VPC Peering that was blocking access to a pod using the “runai bash” command.
[NGCC-25454] Addressed an issue where the gpu-operator version did not match the expected version in the deployment manifest.
[NGCC-25470] Fixed a problem with Private Access that caused newly created jobs to remain in Pending status.
Known Issues#
Only H100 nodes are supported for AWS and GCP, not GB200 NVL72.
NVCF deployment is not supported.
Please refer to the Security Restrictions and Cluster Limitations section for more details.
1.3.0#
Feature#
This release supports AWS cluster rebuilds only. DGX Cloud Create customers on AWS will require a new Kubernetes cluster provided by NVIDIA.
Run:ai Agent 2.21 new features as listed here.
PVC data replication is now supported in DGX Cloud Create. This allows teams of AI researchers to easily share data across departments and projects. Please refer to the NVIDIA Run:ai documentation on Data Volumes for more information. Also, there are important notes for protecting the replicas in DGX Cloud Create in the Data Volumes in DGX Cloud Create section.
The Kubeflow version has been upgraded to 1.9.1.
The NVIDIA GPU Operator is 25.3.0 with its features and fixes described here.
Note that the version of NVIDIA GDRCopy has been set to 2.5.
Updated NVIDIA Data Center GPU drivers:
AWS: 570.133.20
GCP: 570.124.06
Kubernetes versions: 1.30
Updated AWS AMI to the 6.8 LTS kernel.
Fixed#
Run:ai Agent 2.21 fixes as listed here.
[NGCC-24989] Addressed an issue where a NIM could not start. This was due to a limitation where knative could not use multiple secrets to pull the image from nvcr.io.
Known Issues#
Only H100 nodes are supported for AWS and GCP, not GB200 NVL72.
NVCF deployment is not supported.
Please refer to the Security Restrictions and Cluster Limitations section for more details.
1.2.2#
Feature#
Storage Control feature as documented in the Storage User Guide. This feature provides:
Storage class encapsulation:
a standard set of storage classes to encapsulate shared storage across the CSP storage variants, allowing customers to clearly understand their quota management
Support for object storage mounts:
an API for mounting object storage into customer workloads
Volume protection:
an API for the retention and deletion of provisioned storage volumes
Run:ai Agent 2.20.38 features and fixes as listed here.
Fixed#
[NGCC-23214] Fixed an issue where a secret could not be mounted as a volume.
[NGCC-23229] Fixed an issue where a
runaidgxcnetworkpolicies
resource would have an error if no egress policy was explictly set.[NGCC-23232] Fixed an issue with unhandled XIDs sent to Grafana.
Known Issues#
Please refer to the Security Restrictions and Cluster Limitations section.