Switch Certificate Configuration (ConfigureCertificate)
Switch Certificate Configuration (ConfigureCertificate)
This document describes how the switch state controller configures switch TLS certificates during the Configuring phase. The handler delegates device operations to Component Manager (CM), which in turn calls Rack Manager Service (RMS) asynchronously and polls job status until completion.
Goals
- Install or rotate the switch NVOS certificate as part of initial switch
bring-up, before NVOS admin credentials are stored (
RotateOsPassword). - Keep RMS-specific protobuf and job semantics behind the CM
NvSwitchManagerabstraction so the state handler stays backend-agnostic (RMS, NSM, mock). - Persist the async job ID in controller state so restarts can resume polling.
Placement in the Switch FSM
ConfigureCertificate is a sub-state of SwitchControllerState::Configuring,
before RotateOsPassword, FetchInfo, and Validating.
Transient CM or transport failures during Start or WaitForComplete return
StateHandlerError and leave the switch in the same sub-state for retry on the
next handler iteration. Only a terminal RMS job status of Failed (or missing
component manager while polling) transitions to Error.
Sub-states (ConfigureCertificateState)
Job status values use ConfigureSwitchCertificateState: Started,
InProgress, Completed, Failed.
Domain name (domain_name) and mTLS services
The switch state handler passes:
domain_name = Nonefor both bring-up and maintenance reconfiguration. RMS receives an unsetdomainfield; rack association is enforced separately when deciding whether certificate configuration can run.servicesfromSwitchStateHandlerServices.switch_mtls_services, sourced from[switch_state_controller].switch_mtls_servicesin site config. When omitted or empty, all supported switch mTLS services are used.
Rack NMX cluster maintenance uses a separate service list:
[rack_state_controller].nmx_cluster_switch_mtls_services (defaults to
ScaleUpFabric manager and telemetry interface services). See
Rack State Machine.
Component Manager API
CM exposes two methods used by the switch configuration handler:
SwitchEndpoint is built from:
- Switch BMC MAC and BMC IP (required)
- Associated NVOS machine interface MAC and IP (both required; matches power-control validation in
maintenance.rs) - NVOS admin credentials from the credential vault (
SwitchNvosAdmin); endpoint resolution failures duringStarttransition toError(they do not returnStateHandlerError).
Backend matrix
RMS integration
Identity resolution (RMS backend only)
Before calling RMS, RmsBackend:
- Looks up
switch.idandswitch.rack_idviafind_rms_identities_by_macs. - Builds
rms::NodeInfofrom theSwitchEndpointand resolved identity. - Passes optional
domain,services, and device info to RMS.
If the switch has no rack_id in the database, identity resolution fails and CM
returns an internal error (the state handler normally skips earlier when
switch.rack_id is unset during bring-up).
RMS RPCs
RMS job states are mapped in map_rms_configure_switch_certificate_job_state.
Sequence diagrams
Happy path (RMS backend)
One state-controller iteration runs Start; a later iteration runs
WaitForComplete until RMS reports completion.
Skip path (no rack association)
Error path (job failed)
Maintenance reconfiguration (ReconfigureCertificate)
Operator maintenance can reinstall switch certificates without leaving Ready
permanently. The flow reuses certificate.rs with
ConfigureSwitchCertificateMode::Reconfigure:
Differences from bring-up:
Entry point: switch_maintenance_requested.operation = ReconfigureCertificate
from Ready or Error. See Switch State Diagram.
Persistence
Controller state is stored in switches.controller_state (JSON). Example
after job submission:
The job ID is only in controller state (unlike rack firmware upgrade, which
also stores a separate firmware_upgrade_job row). This is sufficient for a
single-switch, single-job certificate operation.
Implementation map
Testing
Integration tests cover:
- Skip when
rack_idor component manager is absent →RotateOsPassword Start→WaitForCompletewith mock CMWaitForComplete→RotateOsPasswordon successWaitForComplete→Erroron failed job statusConfigureCertificate(completed or skipped) →RotateOsPassword→FetchInfo→Validating- Maintenance
ReconfigureCertificatesuccess and failure paths
Run with DATABASE_URL set (sqlx test harness), filter:
cargo test -p carbide-api-core configure_certificate.
Future work
- Decide whether
domainshould be set explicitly (for example torack_id) once the RMS certificate catalog contract is finalized. - Decide whether NSM backend should support certificate configuration or remain explicitly unsupported.