Ingesting Hosts

View as Markdown

Once you have NVIDIA Infra Controller (NICo) up and running, you can begin ingesting machines.

Prerequisites

Ensure you have the following prerequisites met before ingesting machines:

  1. You have the carbide-admin-cli command available: You can compile it from sources or you can use the pre-compiled binary. Another choice is to use a containerized version.

  2. You can access the NICo site using the carbide-admin-cli.

  3. The NICo API service is running at IP address NICo_API_EXTERNAL. It is recommended that you add this IP address to your trusted list.

  4. DHCP requests from all managed host IPMI networks have been forwarded to the NICo service running at IP address NICo_DHCP_EXTERNAL.

  5. You have the following information for all hosts that need to be ingested:

    • The MAC address of the host BMC
    • The chassis serial number
    • The host BMC username (typically this is the factory default username)
    • The host BMC password (typically this is the factory default password)

Update Site

NICo requires knowledge of the current and desired BMC and UEFI credentials for hosts and DPUs. NICo will reset current crendtials to the desired credentials on the BMC and UEFI when ingesting a host. You can use these credentials when accessing the host or DPU BMC yourself, and NICo will use these credentials for its automated processes.

The required credentials include the following:

  • Host BMC Credential
  • DPU BMC Credential
  • Host UEFI password
  • DPU UEFI password

Note: The following commands use the <api-url> placeholder, which is typically the following:

$https://api-<ENVIRONMENT_NAME>.<SITE_DOMAIN_NAME> --forge-root-ca-path <FORGE_ROOT_CA_PATH> --client-cert-path <CLIENT_CERT_PATH> --client-key-path <CLIENT_KEY_PATH>

Update Host and DPU BMC Password

Run this command to update the desired Host and DPU BMC password:

$carbide-admin-cli -c <api-url> credential add-bmc --kind=site-wide-root --password='x'

Update Host UEFI Password

Run this command to generate the desired host UEFI password:

$carbide-admin-cli -c <api-url> host generate-host-uefi-password

Run this command to update host uefi password:

$carbide-admin-cli -c <api-url> credential add-uefi --kind=host --password='<password-gemerated-in-previous-step>'

Run this command to update DPU uefi password:

$carbide-admin-cli -c <api-url> credential add-uefi --kind=dpu --password='x'

Add Expected Machines Table

NICo needs to know the factory default credentials for each BMC, which is expressed as a JSON table of “Expected Machines”. The serial number is used to verify the BMC MAC matches the actual serial number of the chassis.

Prepare an expected_machines.json file as follows:

1{
2 "expected_machines": [
3 {
4 "bmc_mac_address": "C4:5A:B1:C8:38:0D",
5 "bmc_username": "root",
6 "bmc_password": "default-password1",
7 "chassis_serial_number": "SERIAL-1"
8 },
9 {
10 "bmc_mac_address": "C4:5A:FF:FF:FF:FF",
11 "bmc_username": "root",
12 "bmc_password": "default-password2",
13 "chassis_serial_number": "SERIAL-2"
14 }
15 ]
16}

Only servers listed in this table will be ingested, so you must include all servers in this file.

Optional Per-Host Fields

Each entry supports additional optional fields:

  • host_lifecycle_profile (object): Per-host profile for settings that affect state-machine progression. Future per-host knobs should be added here.

    • disable_lockdown (bool, default false): When true, the state machine does not lockdown the host during lifecycle management. This is useful for automation workflows that need lockdown persistently disabled.
    1{
    2 "bmc_mac_address": "C4:5A:B1:C8:38:0D",
    3 "bmc_username": "root",
    4 "bmc_password": "default-password1",
    5 "chassis_serial_number": "SERIAL-1",
    6 "host_lifecycle_profile": {
    7 "disable_lockdown": true
    8 }
    9}
  • dpf_enabled (bool): Enable/disable DPF for this host.

  • dpu_mode ("dpu_mode" | "nic_mode" | "no_dpu"): Per-host DPU operating mode.

  • bmc_retain_credentials (bool): Skip BMC password rotation.

  • default_pause_ingestion_and_poweron (bool): Pause ingestion and power-on for this host.

  • bmc_ip_address (string): Static BMC IP (pre-allocates a machine interface).

When the file is ready, upload it to the site with the following command:

$carbide-admin-cli -c <api-url> em replace-all --filename expected_machines.json

Approve all Machines for Ingestion

NICo uses Measured Boot using the on-host Trusted Platform Module (TPM) v2.0 to enforce cryptographic identity of the host hardware and firmware. The following command configures NICo to approve all pending machines based on PCR Registers 0, 3, 5, and 6.

$carbide-admin-cli -c <api-url> mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6"

What Happens After Approval: Ingestion to Ready

Once machines are approved, NICo’s Site Explorer begins automatically ingesting them. No further operator action is required under normal circumstances.

The high-level flow is:

  1. DHCP discovery: the host BMC sends a DHCP request; NICo assigns an IP and Site Explorer probes the BMC over Redfish to collect a full inventory. Site Explorer authenticates using the factory default credentials from the expected machines table, then rotates the BMC password to the site-wide credential. See Redfish Workflow for details.
  2. Preingestion: before pairing, NICo runs a preingestion state machine against each discovered BMC endpoint (both host and DPU). It checks that the BMC clock is within an acceptable drift of the site time, resetting the BMC if not. For host endpoints, firmware components are upgraded if they are below the minimum version required for ingestion.
  3. DPU-host pairing: Site Explorer correlates host and DPU serial numbers to form matched pairs. Once all DPUs are validated and matched, the ManagedHost object is created and the state machine starts.
  4. DpuDiscoveringState / DPUInit: NICo configures Secure Boot on the DPU, installs the DPU OS (BFB image), and power-cycles the host to apply the new DPU configuration.
  5. HostInit: NICo configures BIOS, sets the host boot order, optionally collects TPM attestation measurements, waits for hardware discovery via the scout agent, and applies UEFI lockdown. When the scout agent reports back, NICo replaces the temporary predicted host ID (prefix fm100p) with a stable host ID (prefix fm100h) derived from the host’s own DMI serial data or TPM certificate.
  6. BomValidating / Validation: NICo validates the discovered hardware against the expected SKU. If hardware validation is enabled, the host is rebooted and tested before proceeding.
  7. Ready: the host transitions through HostInit/Discovered and enters the available pool, ready for an instance to be assigned to it.

For the complete state transitions, including substates, retry logic, and reprovision paths, see the Managed Host State Diagrams.


Troubleshooting: Host and DPU Ingestion Issues

When a machine is not being created or is stuck in a pre-Ready state, carbide-api logs are the primary investigation tool. Filtering logs by the host BMC IP or DPU BMC IP is often the fastest way to understand where ingestion or pairing is failing.

You can check the current detailed state of any managed host using:

$carbide-admin-cli -c <api-url> managed-host show --all
$carbide-admin-cli -c <api-url> managed-host show <machine-id>

For a full guide on diagnosing stuck objects, including how to use the NICo Grafana dashboard and how to read state handler error logs, see Stuck Objects Runbook.

Endpoint Exploration Errors

Before pairing can occur, Site Explorer must successfully explore each BMC endpoint. Exploration failures are logged and surfaced in carbide-api logs and the NICo Grafana dashboard. Common error types:

Error typeLikely cause
ConnectionTimeoutBMC unreachable on the OOB network; check cabling and DHCP routing
ConnectionRefusedNo Redfish API exposed at the target IP; the DPU admin IP is often mistakenly probed here
Unauthorized / AvoidLockoutBMC credentials do not match the expected machines table or site vault; see Adding New Machines: BMC Password Requirements
MissingCredentialsCredentials not yet available in vault; check that site-wide BMC credentials are configured
UnsupportedVendorBMC vendor is not supported by this version of NICo
RedfishErrorUnexpected Redfish response; check BMC firmware version and carbide-api logs for the full response body
InvalidDpuRedfishBiosResponseDPU BIOS endpoint returned an unexpected response; the DPU may need a fresh OS install

For a complete reference of all Redfish endpoints and required response fields, see Redfish Endpoints Reference.

Common Blockers During Host + DPU Pairing

The following are the conditions in which Site Explorer cannot complete pairing and logs a host_dpu_pairing_blockers_count metric. Each requires operator investigation.

Metric labelDescriptionAction
dpu_nic_mode_unknownDPU mode cannot be determined; DPU BMC firmware is likely too old.Install a fresh DPU OS (which also upgrades firmware); see Installing a Fresh DPU OS below
dpu_pf0_mac_missingDPU is in DPU mode but its pf0 MAC address is not retrievable.Install a fresh DPU OS; see Installing a Fresh DPU OS below
manual_power_cycle_requiredDPU mode was changed but the host vendor does not support automated power cycling.Manually power-cycle the host at the data center level
host_system_report_missingHost BMC Redfish returned no valid system report; likely a BMC firmware issue or transient error.Check carbide-api logs for the host BMC IP
no_dpu_reported_by_hostHost BMC reports no BlueField PCIe devices.Check DPU seating and host BMC firmware version
boot_interface_mac_mismatchHost boot MAC does not match the pf0 MAC of any discovered DPU.Check exploration reports and carbide-api logs for both the host and DPU BMC IPs
viking_cpld_version_issueNVIDIA Viking (DGX): CPLDMB_0 firmware below minimum required version (0.2.1.9).Contact the data center team for a full DC power cycle

For DPU pairing failures, including dpu_pf0_mac_missing and cases where the DPU is in an unknown or corrupt state, a common fix is to install a vanilla pre-ingestion BFB image via rshim to return the DPU to a clean state. This runs as part of the preingestion state machine:

$carbide-admin-cli -c <api-url> site-explorer copy-bfb-to-dpu-rshim \
> --host-bmc-ip <host-bmc-ip> \
> <dpu-bmc-ip>

This command copies the NICo BFB image directly to the DPU via rshim (SSH to the DPU BMC) and triggers a DPU reboot to complete the installation. After the BFB is installed, NICo power-cycles the host automatically to apply the new DPU image.

Note: The --host-bmc-ip flag is required. NICo uses it to power-cycle the host after the BFB copy completes. Use --pre-copy-powercycle if the host needs to release rshim control to the DPU BMC before the copy can start.

For additional DPU-specific troubleshooting including Secure Boot configuration, BMC password resets, and firmware version checks, see Adding New Machines to an Existing Site.


Managing the Expected Machines Table

The expected machines table in the carbide-api database holds the following fields per host:

  • Chassis Serial Number
  • BMC MAC Address
  • BMC manufacturer’s set login
  • BMC manufacturer’s set password
  • DPU chassis serial number (only needed for DGX-H100 or other machines where the NetworkAdapter serial number is not available in the host Redfish)

Individual operations

Use carbide-admin-cli to operate on individual entries:

$carbide-admin-cli -c <api-url> em update ...
$carbide-admin-cli -c <api-url> em add ...
$carbide-admin-cli -c <api-url> em delete ...

Bulk operations

Replace all entries from a JSON file:

$carbide-admin-cli -c <api-url> em replace-all --filename expected_machines.json

Erase all entries:

$carbide-admin-cli -c <api-url> em erase

Export

Export the current table as JSON:

$carbide-admin-cli -c <api-url> -f json em show