Configuring InfiniBand Partitions
This page is the Day-1 configuration guide for InfiniBand partitions in NICo. It describes how an operator points NICo at UFM, how partitions are allocated and assigned to tenant instances, and how to verify that a host has ended up in the partitions it should. Tenant isolation is a property of how partitions are assigned — for the cross-fabric isolation picture, see Network Isolation.
The InfiniBand fabric itself — UFM installation, gv.cfg / opensm.conf
tuning, M_Key and SA_Key configuration, and static topology files — is
covered separately in the InfiniBand Setup
runbook. That runbook is a prerequisite for
this page: NICo’s partition guarantees rest on a properly hardened UFM and
subnet manager.
Related pages
- Network Isolation — cross-fabric isolation overview; explains how IB partitions fit alongside Ethernet and NVLink
- InfiniBand Setup Runbook — UFM and OpenSM hardening (prerequisite)
- InfiniBand NIC and Port Selection — how NICo picks which host NICs / ports are managed
- Networking Integrations — shared architectural patterns across all three fabrics
The Partition Model
InfiniBand partitioning in NICo is built on the InfiniBand-native P_Key mechanism enforced by the subnet manager. The operator-facing chain is:
Read it top to bottom:
- A tenant instance has one or more IB interfaces, one per InfiniBand port on the host the instance is allocated to.
- Each IB interface references exactly one
IbPartitionby ID — this is the NICo object the tenant manipulates. - The
IbPartitioncorresponds to a single P_Key on UFM. NICo either allocates the P_Key value from a configured range or honours an explicit P_Key the operator supplied. - Enforcement happens at the subnet manager. A host port that is not a member of a P_Key cannot exchange InfiniBand traffic with any other member of that P_Key, regardless of physical connectivity. This is fabric-side, not host-side: a misconfigured host cannot bypass it.
Two instances in different P_Keys cannot send any IB traffic to each other; two instances in the same P_Key can. There is no operator-visible “peering” concept for InfiniBand the way there is for Ethernet VPCs — sharing requires placing both instances’ interfaces in the same partition.
What Lives Where
NICo treats UFM as the authoritative source for observed fabric state. It does not cache UFM partition membership separately from what the monitor last read. This means that direct out-of-band changes to UFM (an operator editing partitions in the UFM UI, for example) will be detected on the next monitor iteration and reconciled back to NICo’s intended state.
Operations: Who Does What
InfiniBand splits cleanly between operator site setup and tenant partition management. See Network Isolation → Who configures what, and how for the role and interface model.
The UFM-facing setup (the first two rows) is the operator’s responsibility and
is described in Configuring NICo to Talk to UFM.
Everything a tenant does — creating partitions and attaching interfaces — goes
through the REST API or nicocli; the gRPC nico-admin-cli rows are operator
triage and break-glass paths that the REST API does not expose.
Configuring NICo to Talk to UFM
Two TOML blocks are involved. Both live in the API server’s config file.
[ib_fabrics.<name>] — Endpoints and P_Key Pool
Each named entry defines one InfiniBand fabric that NICo manages.
Fields:
P_Key ranges may be extended in future restarts but never shrunk: removing or narrowing a range under live tenants would orphan allocated partitions. Plan the pool with headroom.
[ib_config] — Fabric Toggles
UFM credentials
UFM API credentials are not stored in TOML. They are read from the configured secrets backend (Vault under standard deployments) by the UFM client during initialisation. Rotate them at the secrets backend; NICo picks up the new value on its next client re-initialisation.
P_Key Allocation
When a tenant creates an InfiniBand partition (REST
POST …/nico/infiniband-partition, or nicocli infiniband-partition create),
the request may include or omit a desired P_Key:
pkeyomitted. NICo allocates a free P_Key from the configured pool ranges and returns it on the response. This is the normal tenant flow.pkeyspecified (hex, for example"0x76b"). NICo accepts the request only if the requested value falls inside a pool range that is not marked as auto-assigned, and is otherwise free. Otherwise the request is rejected.
The model’s IbPartition object retains the allocated P_Key for the
lifetime of the partition. There is no “renumber” operation; to change a
P_Key, delete the partition and create a new one (which the tenant flow
handles via instance reconfiguration).
Updating a partition (nicocli infiniband-partition update) is restricted to
fields other than P_Key (name, MTU, rate limit, and so on). Deleting one
(nicocli infiniband-partition delete) requires that no instance still
references the partition.
Membership: Full vs Limited
UFM distinguishes full and limited P_Key membership. Full members can communicate with both full and limited members of the same P_Key; limited members can only talk to full members.
NICo’s posture:
- The
IbPortMembershipenum is read-only from NICo’s perspective. The monitor records whatever UFM reports for each port-in-partition binding. - NICo does not expose a config option to choose full or limited
membership when creating a partition. Whatever UFM is configured to use
(typically full members, with the default partition restricted by the
default_membership = limitedhardening in the IB runbook) is what appears on a NICo-managed binding. - The monitor flags a security alert if the default partition shows full membership on a NICo-managed port; that is an indication the UFM hardening described in the runbook has not been applied.
The IbFabricMonitor
IbFabricMonitor is the background reconciler inside the API server.
Every iteration it:
- Reads UFM state: port information (state, GUID, LID), partition membership lists, fabric version, M_Key / SM_Key / SA_Key configuration, and default-partition membership.
- Compares observed UFM state to the desired state implied by each
instance’s
InstanceInfinibandConfig. - For each IB interface in an instance config, if the host GUID is not
already a member of the expected P_Key, calls
bind_ib_ports()to add it. - For any GUID found in a NICo-managed P_Key that no longer matches a
live instance config, calls
unbind_ib_ports()to remove it. - Updates per-machine InfiniBand status observations in the NICo
database, which is what feeds the
configs_synced.infinibandfield onInstanceStatus.
Cadence is set by fabric_monitor_run_interval (default 60 seconds). After
applying any UFM changes, the monitor accelerates the next iteration to
~1 second so that convergence shows up quickly in observed state. Once a
steady iteration completes with no changes, the monitor returns to the
configured interval.
The monitor exposes metrics under the nico_ib_monitor_* namespace:
How a Tenant Ends Up in a Partition
For a tenant instance with an IB interface attached to a partition:
- The tenant updates the instance (REST
PATCH …/nico/instance, ornicocli instance update) with an InfiniBand interface configuration that references the desired partition ID for each IB port. - NICo validates the config (every referenced partition exists, ownership matches the tenant) and stores it in the database.
- The next
IbFabricMonitoriteration observes the new desired state and issuesbind_ib_ports()for each host GUID that is not already a member of the expected P_Key. - UFM updates partition membership. On the following iteration the monitor reads back the new membership and updates the machine’s IB status observation.
InstanceStatus::infiniband::configs_syncedflips totrueonce observed UFM state matches desired state. The aggregateconfigs_syncedand therefore the instance’sReadystate follow.
Tenants observe the in-flight state as Configuring and the
InstanceStatus machine remains in WaitingForNetworkConfig until the
monitor reports convergence.
Force-Delete and Cleanup
When an instance is released or its host is force-deleted, NICo clears
the IB interfaces from the instance config. The reconciler then sees host
GUIDs in NICo-managed P_Keys that no longer correspond to any live
instance and removes them via unbind_ib_ports().
NICo tracks the cleanup with an IbCleanupPending health alert on the
machine. The alert is set when cleanup is required and cleared once the
monitor confirms that every GUID has been removed from UFM-side
partitions. A machine with an outstanding IbCleanupPending alert is
ineligible for reuse by another tenant: this is the IB equivalent of the
Ethernet termination guard described in
Default Isolation.
There is no dedicated “force-delete IB partition” operation. Partitions
persist in the NICo database independent of instance churn; their membership
is what is reconciled against UFM. To remove a partition entirely, every
instance referencing it must release first, then the tenant’s
nicocli infiniband-partition delete (REST DELETE …/nico/infiniband-partition/{id})
will succeed.
Configuration Workflow
Operator (per site)
- Stand up UFM per the InfiniBand Setup runbook.
Confirm
default_membership = limited, M_Key / SA_Key hardening, and any required static topology configuration. - Provision a UFM API user with permission to read ports / partitions and to create / update / delete partitions. Store its credentials in the secrets backend.
- In the NICo API server config:
- Set
[ib_config].enabled = trueand any fabric-wide MTU / rate / service-level defaults. - For each fabric, define
[ib_fabrics.<name>]with the UFM endpoint(s) and one or morepkeysranges. Size the ranges with room for future growth.
- Set
- Restart the API server. The
IbFabricMonitorbegins its periodic reconciliation on the next tick.
Tenant (per partition)
All tenant steps use the REST API or nicocli; none require TOML or
nico-admin-cli.
nicocli infiniband-partition create(RESTPOST …/nico/infiniband-partition) for each isolation domain the tenant needs (typically one per workload).nicocli instance update(RESTPATCH …/nico/instance) to attach each IB interface to the appropriate partition.- Wait for
configs_synced.infiniband = trueto converge.
Verification
NICo does not ship a single “is IB healthy” command. Verification is a short, repeatable checklist.
- UFM is reachable. Check the API server log for “Failed to create
UFM client” or similar startup errors. Confirm
nico_ib_monitor_iteration_latencyis being recorded (the monitor is running) and that UFM error counters are flat. - Configured partitions exist and have converged. List partitions with
nicocli infiniband-partition list(RESTGET …/nico/infiniband-partition) and confirm each is in a converged state. For deeper internal state during triage — the state-machine outcome field that surfaces UFM sync failures — an operator can usenico-admin-cli ib_partition show(--id,--tenant-org-id, or--name), which the REST API does not expose. - Host is a member of the expected partitions. Inspect the
machine’s
infiniband_status_observationvia the machine debug tooling and confirm each managed port reports membership in the intended P_Key. Cross-check with the live UFM partition table for the same partition. - No anomalies in the monitor metrics.
nico_ib_monitor_machines_with_missing_pkeys_countandnico_ib_monitor_machines_with_unexpected_pkeys_countshould both be0in steady state. Either being non-zero is a divergence between intent and UFM state and warrants investigation. - No outstanding cleanup. Confirm no managed machine has an
IbCleanupPendingalert.
Limitations Worth Knowing
The IB integration is in production but with the following gaps:
- Single-endpoint UFM today. Only the first entry in
endpointsis used at runtime. UFM HA is handled by UFM itself (virtual IP / HA pair); the multi-endpoint config field exists for forward compatibility. - P_Key ranges are append-only in practice. Restarting with a narrowed range under live partitions is unsafe; ranges should only grow.
- No tenant-facing UFM-reachability probe. Operators rely on
metrics and log lines rather than a
ping-style health command. - SHARP, index-0, and default-membership knobs are not yet wired. The model has fields reserved for these but they are not honoured by the integration.
- UFM error taxonomy is incomplete. Some UFM failure modes surface as a generic error in the API server log. Distinguishing transient network issues from misconfiguration may require correlating against UFM-side logs.