NVIDIA DGX SuperPOD: Release Notes 10.24.01

Introduction

This document covers the NVIDIA Base Command™ Manager (BCM) 10.24.01 software release on NVIDIA DGX SuperPOD™ configurations. Except for Component Versions, the information herein is the same as in the NVIDIA Base Command Manager Release Notes.

Information about BCM and DGX SuperPOD is available at:

Important

The NVIDIA DGX SuperPOD: Release Notes 10.24.01 is also available as a PDF.

Component Versions

DGX SuperPOD component versions for this release are in Table 1.

Table 1. Common component versions

Component

Version

BCM ISO

10.24.01

DGX OS

6.1.0

Ubuntu

Ubuntu 22.04.1 LTS

Enroot

3.4.1-1

CUDA toolkit

12.2

DCGM

3.1.8

Cumulus OS

5.5.1

Mellanox InfiniBand Switch (DGX H100)

MLNX OS version: 3.11.2016

HCA Firmware: CX7 - 28.39.2048

Mellanox InfiniBand Switch (DGX A100)

MLNX OS version: 3.11.2016

HCA Firmware: CX7 - 28.39.2048

Slurm

23.02.7

Mellanox OFED Driver (A100 and H100)

23.10-1.1.9.0 LTS (Slogin and DGX nodes)

DGX kernel

5.15.0-1042-nvidia

GPU Driver

535.129.03

Lustre Client

lustre -client-modules-5.19.0-45-generic 2.14.0-ddn125

UFM

UFM Enterprise SW: 6.15.1-4

HPL

hpc-benchmarks:23.10

NCCL

tensorrt:23.12-py3

DGX FW

1.1.3

Change Requests

General

New Features

  • The head node installer will now create a new /etc/cm-install-release file to keep a record of the installation time and media that has been used

  • Added support for upgrading BCM3/Bright9.2 clusters to BCM 10

Improvements

  • Added cuda-driver-535 package

  • The mlnx-ofed packages’ installation scripts will now pin down the kernel packages for Ubuntu when deploying MOFED

  • Updated mlnx-ofed58 to 5.8-4.1.5.0

  • Updated mlnx-ofed23.10 to 23.10-1.1.9.0

  • Updated cuda12.3 to 12.3 update 2

  • Updated cm-nvhpc to 23.11

CMDaemon

Improvements

  • Update the Kubernetes users’ configuration files with Run:ai configuration settings

  • Redirect the output from cm-burn to tty1

  • Added new GPU totals metrics for temperature and nvlink bandwidth

  • Allow the option to select BCM GPU autodetection configuration mechanism also in the Slurm WLM cluster settings, and not only in the Slurm WLM client role

  • Ensure kubelets are able to join a Kubernetes cluster also after the initial certificates have expired (which typically happens after 4 hours)

Fixed Issues

  • An issue with sorting the data passed to the PromQL engine, which can result in an error “expanding series: closed SeriesSet” when running instant queries

  • An issue where the exclude list snippets are not being cloned when cloning a software image

  • Rare deadlock in CMDaemon which can occur while committing a head node

  • An issue with configuring the GATEWAYDEV on RHEL 9 when a VLAN network interface is configured on top of the BOOTIF interface

  • An issue where /etc/systemd/resolved.conf is not being added to the imageupdate exclude list for the compute nodes

  • An issue where install-license may not copy some certificates to all /cm/shared* on a multi-arch or multi-os cluster

  • An issue with the Prometheus exporter when entities have recently been removed

  • An issue with parsing multiple pending Kubernetes CSR per node, which can result in none of the CSR’s being approved

  • On SLES base distribution, an issue with updating the cluster landing page with links to the dashboards of other integrations such as Kubernetes or Ceph

  • An issue where CMDaemon may not restart the Slurm services automatically when the number of CPUs configuration settings for some nodes change

  • An issue where CMDaemon may hang waiting for events while stopping

  • An issue where the cmsh call to create a certificate may return before the certificate is written

  • An issue where entering the cmsh biossettings mode may result in an “Error parsing JSON” error message

  • In some cases, an issue with configuring Slurm when GPU automatic configuration by BCM has been selected

  • In some cases, an issue with setting up Etcd due to insufficient permissions to access the Etcd certificate files

  • An issue where a WLM job process ID may be added to an incorrect cgroup, which in some cases may result in the process being killed when another WLM job running on the same node completes

  • An issue with collecting GPU job metrics for containerized Pyxis jobs

cm-kubernetes-setup

New Features

  • Use Calico 3.27, Run:ai 2.15.2, and GPU operator v23.9.1 for new Kubernetes deployments using cm-kubernetes-setup

Improvements

  • Allow the option to choose Network Operator version 23.10.0

  • Allow the option to configure a custom Kubernetes Ingress certificate

cm-lite-daemon

Improvements

  • Added new metrics for the total traffic on network interfaces

cm-wlm-setup

Fixed Issues

  • In some cases, an issue with installing Pyxis on multi-arch or multi-distro software images

  • Pyxis enroot is now configured to use its internal value for the cache directory, which previously was being set to a directory under /run

cmsh

New Features

  • Added a new cmsh “multiplexers” command in monitoring / setup which can show which nodes will run for other entities a specified dataproducer

pythoncm

Improvements

  • Added a new pythoncm example script total-job-power-usage.py for calculating WLM jobs power usage