Release Notes#

1.3.1#

Feature#

  • This release supports a cluster upgrade path for AWS customers. This includes AWS customers who make use of the Private Access feature.

  • This release supports GCP cluster rebuilds only. DGX Cloud Create customers on GCP will require a new Kubernetes cluster provided by NVIDIA.

Fixed#

  • [NGCC-25312] Fixed a infrastructure issue when taking down AWS clusters which blocked firewall deletion.

  • [NGCC-25338] Addressed an issue where serviceAccounts were not found in a cluster spec manifest file.

  • [NGCC-25340] Fixed a DNS issue in Private Access that prevented a jupyter-tensorboard workload from starting.

  • [NGCC-25352] Fixed an issue in Private Access with VPC Peering that was blocking access to a pod using the “runai bash” command.

  • [NGCC-25454] Addressed an issue where the gpu-operator version did not match the expected version in the deployment manifest.

  • [NGCC-25470] Fixed a problem with Private Access that caused newly created jobs to remain in Pending status.

Known Issues#

1.3.0#

Feature#

  • This release supports AWS cluster rebuilds only. DGX Cloud Create customers on AWS will require a new Kubernetes cluster provided by NVIDIA.

  • Run:ai Agent 2.21 new features as listed here.

    • PVC data replication is now supported in DGX Cloud Create. This allows teams of AI researchers to easily share data across departments and projects. Please refer to the NVIDIA Run:ai documentation on Data Volumes for more information. Also, there are important notes for protecting the replicas in DGX Cloud Create in the Data Volumes in DGX Cloud Create section.

    • The Kubeflow version has been upgraded to 1.9.1.

  • The NVIDIA GPU Operator is 25.3.0 with its features and fixes described here.

    • Note that the version of NVIDIA GDRCopy has been set to 2.5.

  • Updated NVIDIA Data Center GPU drivers:

  • Kubernetes versions: 1.30

  • Updated AWS AMI to the 6.8 LTS kernel.

Fixed#

  • Run:ai Agent 2.21 fixes as listed here.

  • [NGCC-24989] Addressed an issue where a NIM could not start. This was due to a limitation where knative could not use multiple secrets to pull the image from nvcr.io.

Known Issues#

1.2.2#

Feature#

  • Storage Control feature as documented in the Storage User Guide. This feature provides:

    • Storage class encapsulation:

      • a standard set of storage classes to encapsulate shared storage across the CSP storage variants, allowing customers to clearly understand their quota management

    • Support for object storage mounts:

      • an API for mounting object storage into customer workloads

    • Volume protection:

      • an API for the retention and deletion of provisioned storage volumes

  • Run:ai Agent 2.20.38 features and fixes as listed here.

Fixed#

  • [NGCC-23214] Fixed an issue where a secret could not be mounted as a volume.

  • [NGCC-23229] Fixed an issue where a runaidgxcnetworkpolicies resource would have an error if no egress policy was explictly set.

  • [NGCC-23232] Fixed an issue with unhandled XIDs sent to Grafana.

Known Issues#