Rack-Level Administration
Rack-Level Administration
Rack-Level Administration
NICo allows site administrators to manage bare metal for NVIDIA Multi-Node NVLink (MNNVL) systems such as GB200, providing the necessary rack-level topology, operations, and automation needed for hardware lifecycle management and resource provisioning.
NICo supports the rack component (“tray”) grouping schema with topology and location for rack, NVLink domain, and future larger rack grouping units such as row, scalable unit, and super pods. It manages the following components on GBx00 racks:
NICo provides APIs and automated workflows to manage these components for the following core lifecycle management task categories:
In order to use rack-level administration features today, NICo deployment needs to include NICo Flow, NSM, and PSM, properly configured with the REST API, site agent, temporal workflow, and NICo Core. The following diagram shows the control and data flows within NICo services and dependencies.
The end-to-end service path for NICo rack-level administration goes in the following order between services.
HW Lifecycle REST API -> NICo Flow -> NICo Core -> Component HW Backend
HW Lifecycle REST API: NICo provides all rack-level administration operations via a set of HW Lifecycle REST APIs. Besides providing inventory, bringup, validation, power control, and firmware update functionalities for racks and trays (rack components), it supports referencing racks using the DCIM-supplied rack ID (e.g., “A12”) and referencing trays using rack-based tray addressing (e.g. “Rack A12 Tray Slot 19” or “Rack A12 Compute Tray #3”), in addition to using tray serial numbers or BMC MAC addresses. It also exposes the task sequences running, pending, or completed on racks and trays, and allows cancellation. Non-rack compute machines are also supported.
NICo Flow: The REST APIs are supported via NICo Flow, which is a NICo software component (discrete service) that uses NICo Core APIs to orchestrate hardware operation sequences and automate hardware operations, with user defined customization via dynamic rules and policies, in order to scale both operation of AI-factory hardware and the evolution of AI-factory software stack, by abstracting the topology and details of the mechanics of updates, maintenance, and responding to state changes in the datacenter.
NICo Flow contains software-defined states for managing a site, such as those for task and workflow, user customizable operation sequence and automation, as well as user-defined policies such as those for HW gating (preventing risky conditions in HW automation), remediation (dealing with broken or degraded HW or services), and leak handling.
NICo Core: NICo Flow calls NICo Core service to perform the actual HW management operations. NICo Core is the original main NICo service that provides all the critical features for bare metal management, such as HW inventory and credential management, discovery and ingestion, state machines for power control and firmware update of component HW, managed host and VPC resource, and HW health reports. NICo Core contains all hardware-defined states for component HW and state-based automations.
Backend: Previously, NICo Core accessed machines directly via BMC. With rack-scale systems, we now have more types of component HW (compute, switch, and powershelf), as well as more ways to access these components (BMC and NVUE). The complexity of these HW access and management operations are now moved out of NICo Core into the backend for NICo. NICo backend is an extensible interface for different types of hardware to be plugged into and managed by NICo.
Today there is a NVSwitch Manager (NSM) backend and a Powershelf Manager (PSM) backend, providing access to switch and powershelf trays in racks, called from NICo Core.
In the near future, NVIDIA Rack Manager Service (RMS) will be shipped as a backend for NICo to provide unified compute, switch, and powershelf trays access and management, as well as optimized default HW sequencing for rack power control and firmware update.
NICo needs to be loaded with the expected rack equipment inventory to be managed. In most cases, the information should be available from a DCIM service.
The expected inventory often contain the following information:
After NICo imports the expected inventory, it goes through the following workflow to discover and ingest the trays and bring up the rack and NVLink domain.
NICo monitors the discovered and ingested trays and racks. It reports the actual inventory with dynamic information such as power status and installed firmware versions. It also compares the actual inventory against the expected inventory, and reports on any discrepancies (e.g. wrong rack installed, wrong slot installed, or wrong serial number).
NICo provides power control for racks as well as arbitrary groupings of trays in racks, following predefined or customized power operation sequences.
Sample Rack Power On Sequence:
Sample Rack Power Off Sequence
NICo provides firmware update management for racks as well as arbitrary groups of trays in racks, following predefined or customized firmware update operation sequences.
Sample Rack Firmware Update Sequence
When a rack FW update completes successfully, all compute trays in the rack will have the same firmware version, all switch trays in the rack will have the same firmware version, and the compute tray firmware and switch tray firmware are supposed to be compatible with each other.
Currently, NICo only supports GB200 NVL72 racks, where a rack and a NVL domain overlaps precisely. Hence, domain endpoints are currently not exposed and rack endpoints should be used. This will change in the future.
on, off, cycle, forceoff, forcecycle.on, off, cycle, forceoff, forcecycle.on, off, cycle, forceoff, forcecycle.on, off, cycle, forceoff, forcecycle.