System Control and Configuration#

System Operations#

To bring a DGX GB200 rack online, power on and configure the NVSwitch trays first to establish the NVLink fabric, then power on the compute trays. This sequence ensures high-bandwidth GPU-to-GPU communication is active from the start, preventing potential inter-node communication issues and avoiding the need for compute node restarts.

Before bringing the system online, ensure all infrastructure components are operational: power shelves energized, top-of-rack switches active, and NMX-M services available. You can verify these components are operational through Base Command Manager (BCM).

BCM manages all system components. Use BCM to power on the NVLink switch trays first, followed by the compute trays. Each NVLink switch tray boots with rack-specific configuration, enabling direct GPU communication across all compute trays via NVLink.

After the switch trays are operational, power on the compute trays. As they boot, BCM confirms that the latest updates and configurations are installed in the operating system. The compute trays establish communication with the NVLink switch trays and other resources such as storage and additional compute trays.

Power operations are orchestrated through NVIDIA Mission Control. Refer to the Administration Guide for a complete list of available system operations.

Operating System Updates#

The operating system comes pre-installed on each tray but is managed by Base Command Manager. When a compute tray boots, it checks for updates over the network. If updates are available, they are installed before booting; otherwise, the tray boots quickly from local storage. This ensures each compute tray boots with the correct operating system and latest configuration.

You can install additional software and system configurations directly to the software images in the Base Command Manager head node. These changes become persistent once assigned to a specific compute tray, providing a single point of management for all operating systems needed in the cluster.

For details on customizing software images configured to boot on system trays, refer to the NVIDIA Mission Control Administration Guide.

Firmware Updates#

Compute and switch trays require periodic firmware updates. Firmware update procedures are released with specific instructions for each release, as different components may require specific actions. Not all components need updates in every release, so always review the firmware release notes for the complete recipe (firmware packages and version numbers).

While firmware can be applied to individual components, it is typically released as a recipe to update all applicable components simultaneously. NVIDIA Mission Control provides features to perform firmware updates efficiently instead of manually updating components one at a time.

Health Checks#

NVIDIA Mission Control continuously monitors system health, verifying the functionality of compute and switch trays. It also monitors critical factors such as potential system leaks that may require immediate operator attention.

Health check issues are highlighted on the NVIDIA Mission Control dashboards, allowing system administrators to investigate the source of problems. Issues may stem from software misconfiguration, hardware problems, or data center conditions such as air temperature or liquid cooling system issues.