What can I help you with?
NVIDIA NVOS User Manual for NVLink Switches v25.02.2141

Cluster Management

NVOS includes a robust infrastructure that enables the execution of cluster applications on the CPU of the switch. This release supports two cluster applications: NMX-Controller and NMX-Telemetry. The NVOS cluster infrastructure streamlines the management and monitoring of these applications, providing a seamless user experience.

Installation and Upgrade

The NVOS cluster infrastructure includes the cluster applications package files within the NVOS image. The packages are automatically installed along with the NVOS image installation and upgrade process, ensuring a hassle-free setup.

Management and Configuration

The NVOS cluster infrastructure provides a user-friendly Command Line Interface (CLI) and RESTful APIs to manage and configure the cluster applications. Users can perform the following tasks:

  • Start and stop the execution of the cluster applications

  • Manage the log verbosity level of the cluster applications

  • Configure common functionalities across all cluster applications, such as the gRPC connection with an external manager

  • Monitor the operational health of the cluster applications

The gRPC connection with the external manager supports three modes: unencrypted, TLS, and mTLS. The cluster applications act as the server-side of the gRPC connection. For encrypted gRPC modes (TLS and mTLS), the cluster applications facilitate the installation of security certificates and support key rotations to maintain a secure communication channel.

Persistence and Recovery

Once the user starts the operation of the cluster applications, any subsequent NVOS boot will automatically restart the cluster applications, ensuring continuous availability. Additionally, upgrading the NVOS will also upgrade the cluster applications seamlessly. User configurations for the cluster infrastructure persist across NVOS reboots and upgrades, eliminating the need for manual reconfiguration. In case of a factory reset of the NVOS, the cluster infrastructure and cluster applications will also be reset to their default state.

The NMX-Controller is a cluster application for fabric SDN services. In the GB200 NVL the SDN services are the subnet manager (SM) and global fabric manager (GFM).

The SDN services configuration and operation is managed via either gRPC interface with external manager or the NVUE CLIs and Rest-APIs.

NMX-T is a robust and scalable telemetry service application designed to collect, aggregate, filter, and stream telemetry data from various sources, such as network devices, compute nodes and various sensors, using multiple protocols. Its primary objective is to provide a centralized platform for ingesting and processing telemetry data, enabling real-time monitoring, analysis, and decision-making across diverse systems and applications.

Key Features

  1. Multi-Source Data Collection: NMX supports ingesting telemetry data from a wide range of sources, including applications, network devices, sensors, and more. It can handle various protocols such as HTTP, gNMI, OTLP, Redfish, Syslog, and custom protocols, ensuring seamless integration with existing infrastructure.

  2. Data Aggregation and Filtering: NMX employs advanced data aggregation and filtering techniques to process incoming telemetry data streams. It can aggregate data from multiple sources, apply custom filters based on predefined rules or conditions, and perform data transformations as needed. This feature enables efficient data management and reduces the overhead of processing irrelevant or redundant data.

  3. Real-Time Streaming: NMX provides real-time streaming capabilities, allowing interested clients to subscribe to specific telemetry data streams. It leverages high-performance messaging protocols, such as gRPC, to ensure low-latency delivery of telemetry data to downstream consumers, such as monitoring tools, analytics platforms, or custom applications.

  4. Scalability and High Availability: NMX is designed to handle large volumes of telemetry data and can scale horizontally to accommodate increasing data loads. It supports load balancing, failover mechanisms, and distributed deployment architectures, ensuring high availability and fault tolerance.

  5. Extensible Plugin Architecture: NMX features a modular and extensible plugin architecture, enabling developers to create custom plugins for data ingestion, processing, and output. This flexibility allows NMX to adapt to new protocols, data formats, and integration requirements as needed.

  6. Security and Access Control: NMX incorporates robust security measures, including authentication, authorization, and encryption mechanisms, ensuring that only authorized clients can access and consume specific telemetry data streams.

  7. Monitoring and Observability: NMX provides comprehensive monitoring and observability capabilities, allowing administrators to track system health, performance metrics, and operational insights. It integrates with popular monitoring tools like Prometheus and Grafana, enabling real-time visualization and alerting.

In a domain of NVLink5 cluster, cluster function can be enabled on one of the NVL switches. More information about Cluster Provisioning Flow can be found in the "Cluster Provisioning Flow" section of the Nvidia GB200 NVL System Bring-up Guide that is part of the documentation package.

Partitions enable the creation of groups of GPUs dedicated to a specific tenant or workload, ensuring isolation from a security perspective. This means there is no communication path between GPUs assigned to different partitions. Within a partition, GPUs have multicast groups and shared memory mapping to enhance communication efficiency.

The NMX-Controller offers APIs with the following functionalities:

  • Create a partition

  • Add or remove GPUs from a partition

  • View partition information

  • Delete partitions

For information about protobuf file, see the attached file. To view its content, use any standart OpenAPI viewer tool.

© Copyright 2025, NVIDIA. Last updated on Apr 23, 2025.