NVIDIA Infra Controller (NICo) is a Bare-Metal-As-A-Service (BMaaS) solution. It manages the lifecycle of hosts, including user OS installation, host cleanup, validation tests, and automated software updates. It also provides host monitoring and virtualized private networking capabilities on ethernet and InfiniBand.
In order to enable virtual private networks (overlay networks), NICo utilizes DPUs as primary ethernet interfaces of hosts.
This document describes how NICo controls DPUs in order to achieve this behavior.
The following guiding principles are for DPU configuration:
DPUs are configured by the NICo site controller via a declarative and stateless mechanism:
dpu-agent) requests the current desired configuration via the GetManagedHostNetworkConfig gRPC API call. Example data of the returned configuration is provided in the Appendix below.nv config apply).dpu-agent also reconfigures a DHCP server running on the DPU, which responds to DHCP requests from the attached host.dpu-agent implements health-checks that supervise whether the desired configurations are in-place and check whether the DPU is healthy (e.g. the agent continuously checks whether the DPU has established BGP peering with TORs and route servers according to the desired configuration).dpu-agent uses the RecordDpuNetworkStatus gRPC API call to report back to the site control plane whether the desired configurations are applied, and whether all health checks are succeeding.PostConfigCheckWait alert. This gives the DPU some time to monitor the stability and health of the new configuration before the site controller assumes that the new configuration is fully applied and operational.NICo uses versioned immutable configuration data in order to detect whether any intended changes have not yet been deployed:
RecordDpuNetworkStatus call.Provisioning/Configuring/Terminating to the administrator.The DPU configuration that is applied can be understood as coming from two different sources:
In order to separate these concerns, NICo internally uses two different configuration data structs and associated version numbers (instance_network_config versus managedhost_network_config). It can thereby distinguish whether a setting that is required by the tenant has not been applied, compared to whether a setting that is required by the control plane has not been applied.
Some example workflows that lead to updating configurations are shown in the following diagram:
One important requirement for NICo is that Hosts/DPUs that are not confirmed to be part of the site are isolated from the remaining hosts on the site.
A DPU might get isolated from the cluster without the DPU software stack being erased (e.g. by site operators removing the knowledge of the DPU from the site database).
In order to satisfy the isolation requirements and to prevent unknown DPUs on the site from using resources (e.g. IPs on overlay networks), an additional mechanism is implemented: If the GetManagedHostNetworkConfig gRPC API call returns a NotFound error, the dpu-agent will configure the DPU/Host into an isolated mode.
The isolated configuration is only applied when the site controller is unaware of the DPU and its expected configuration. In case of any other errors (for example, intermittent communication issues), the DPU retains its last known configuration.
Note: This is not the only mechanism that NICo utilizes to provide security on the networking layer. In addition to this, ACLs and routing table separation are used to implement secure virtual private networks (VPCs).