Site Controller Health
Use this playbook when the problem affects site-controller nodes, core NICo services, cloud sync, or shared infrastructure rather than a single managed host.
Site-controller nodes run the NICo control plane. They are not managed hosts.
Quick Health Checklist
Check critical namespaces for the site:
Node and Kubernetes Layer
Core NICo Services
Postgres health needs more than kubectl get pods:
Control-Plane Networking
Check:
- MetalLB BGP peers
- IP pools
- LoadBalancer services
- FRR speaker status
- DNS and service routing
Site Agent and Cloud Sync
Cloud-to-site sync failures can make the cloud UI and site state disagree.
Check site-agent logs:
Common causes:
- site agent cannot reach
nico-api - mTLS cert projection problem
- DNS cold-cache or startup race
- cloud API connectivity issue
- site agent crash loop
Upgrades and Configuration
For config or upgrade issues:
- lint changed TOML where possible
- confirm generated ConfigMaps contain expected values
- confirm ArgoCD or deployment sync completed
- confirm required secrets were projected
Certificate and Secret Rotation
Credential and certificate issues often surface as unrelated BMC, API, or probe failures.
Check:
- Vault pod health
nico-apito Vault connectivity- certificate renewal on every control-plane node
- projected secrets in affected pods
carbide_api_vault_requests_failed_total
The metric prefix may remain carbide_* even when the service is now named
NICo.