Reference Installation — Manual Phase-by-Phase
Reference Installation — Manual Phase-by-Phase
This guide breaks down every phase of NICo’s setup.sh installation with the exact commands being run. Use this if you need to re-run a single phase, debug a failure, or understand what the script does before running it.
For the automated end-to-end installation using setup.sh, see the Quick Start Guide.
Prerequisites: complete all configuration steps in Step 3 of the Quick Start Guide before running any phase manually.
All commands below assume you are in the helm-prereqs/ directory with the required environment variables set:
Phase 0 — DNS check
Detects cluster type and verifies DNS is ready before any workloads are deployed.
- Kubespray clusters — checks if the
nodelocaldnsDaemonSet is ready; deploysoperators/nodelocaldns-daemonset.yamlif missing and waits for rollout - kubeadm / other — checks CoreDNS readyReplicas >= 1; warns but does not fail if not ready
Phase 1 — local-path-provisioner
Deploys StorageClasses for Vault and PostgreSQL PVCs. The local-path-persistent StorageClass uses reclaimPolicy: Retain so data survives pod deletion and node restarts.
Phase 1b — postgres-operator
Installs the Zalando PostgreSQL Operator. Must be up before Phase 5 creates the forge-pg-cluster resource — the postgresql.acid.zalan.do CRD must be registered first.
Phase 1c — MetalLB
Installs MetalLB 0.14.5 with the FRR BGP speaker, then applies your site-specific IP pool and BGP configuration.
Expected result: MetalLB controller and speaker pods running in metallb-system. BGPPeer sessions established with your TOR switches.
Phase 2 — cert-manager + Vault TLS bootstrap
Three sub-steps — all must complete before Phase 3 (Vault).
2a — cert-manager
2b — Vault TLS bootstrap
Vault requires TLS to start — but the Vault-backed issuer can’t exist before Vault is running. This step breaks the chicken-and-egg problem by using site-issuer (backed by site-root CA) to issue Vault’s own TLS certs before Vault starts.
Phase 3 — Vault
Installs HashiCorp Vault 0.25.0 in 3-replica HA Raft mode. TLS secrets exist in the vault namespace by this point so pods start immediately.
Phase 4 — Initialize and unseal Vault
unseal_vault.sh handles both first-run init and re-unseal on subsequent runs:
- First run:
vault operator init -key-shares=5 -key-threshold=3, stores init JSON asvault-cluster-keyssecret, unseals all three pods - Creates the
forge-systemnamespace with Helm ownership labels - Copies root token to
carbide-vault-tokeninforge-systemfor thevault-pki-configJob
bootstrap_ssh_host_key.sh pre-creates the ssh-host-key Secret in OpenSSH PEM format (idempotent — skips if the secret already exists).
To verify Vault is unsealed:
Phase 5 — external-secrets + carbide-prereqs
After carbide-prereqs installs, wait for the PostgreSQL cluster to provision and for ESO to sync credentials:
Phase 6 — NCX Core
Deploys the main NCX Core application chart. Run from the repo root (ncx-infra-controller-core/), not from helm-prereqs/.
Verify LoadBalancer IPs were assigned from your MetalLB pool:
Phase 7 — NCX REST (carbide-rest)
All sub-steps run from the NCX REST repo directory ($NCX_REPO).
7a — CA signing secret
Generates the ca-signing-secret used by the carbide-rest-ca-issuer ClusterIssuer for Temporal mTLS. Idempotent — skips if the secret already exists.
7b — carbide-rest-ca-issuer
7c — NCX REST postgres
7d — Keycloak
7e — Temporal TLS bootstrap
7f — Temporal
7g — NCX REST helm chart
7h — NCX REST site-agent
The deployment order is critical — do not skip steps.
PKI architecture
The PKI has three layers, built bottom-up:
NCX REST has its own parallel PKI chain for internal services:
The site-agent uses the Vault PKI CA for both directions of mTLS with carbide-api:
- Site-agent presents its client cert (Vault-signed) — carbide-api trusts it via the same CA.
- Site-agent verifies carbide-api’s server cert using
ca.crtfrom the issued secret (Vault PKI CA).
Layer 1 — Bootstrap (no external dependencies)
selfsigned-bootstrap is a cert-manager selfSigned ClusterIssuer with no dependencies. It issues site-root: a 10-year CA certificate stored as Secret site-root in the cert-manager namespace. This is the trust anchor for the entire cluster.
Layer 2 — site-issuer (Vault TLS bootstrap)
site-issuer is a ca ClusterIssuer backed by site-root. It can issue certificates without Vault being up.
This solves the Vault TLS chicken-and-egg problem. Vault requires TLS to start — but vault-forge-issuer (the Vault-backed issuer) can’t exist before Vault is running. site-issuer breaks the cycle by issuing Vault’s own TLS secrets before Vault starts:
These secrets must exist before helmfile sync -l name=vault — setup.sh creates them explicitly in Phase 2 using helm template | kubectl apply.
Layer 3 — vault-forge-issuer (workload PKI)
Once Vault is running and unsealed, the vault-pki-config Job (Helm post-install hook) configures Vault as a PKI backend:
- Enables the
forgecaPKI secrets engine, tunes it to a 10-year max TTL. - Imports
site-root(cert + key) into Vault PKI — Vault becomes an intermediate CA under the same trust root. - Creates PKI role
forge-cluster— allows any name, allows SPIFFE URI SANs, 720h max TTL, EC P-256. - Enables Kubernetes auth and writes two policies:
cert-manager-forge-policy(sign via PKI) andforge-vault-policy(read KV secrets). - Enables KV v2 at
secrets/and AppRole auth for thecarbiderole.
vault-forge-issuer is then created as a cert-manager ClusterIssuer authenticating to Vault via Kubernetes auth. All NCX Core workload SPIFFE certificates and the site-agent’s gRPC client certificate are issued through this issuer.
forge-roots — CA distribution
The forge-roots Secret (containing site-root’s ca.crt) must be present in every namespace where NCX workloads run so pods can verify each other’s SPIFFE certificates.
creationPolicy: Orphan prevents Kubernetes GC from cascading a delete to forge-roots if the ExternalSecret is recreated on helm upgrade.
PostgreSQL architecture
PostgreSQL is deployed as a production-grade 3-node HA cluster managed by the Zalando PostgreSQL Operator (acid.zalan.do). NCX REST also deploys its own simpler postgres StatefulSet in the same postgres namespace for temporal, keycloak, and NCX REST databases.
Credential flow (NCX Core)
The operator automatically creates a per-user credential Secret in the postgres namespace:
ESO’s carbide-db-eso ClusterExternalSecret mirrors this into forge-system as:
forge-pg-cluster-env ConfigMap
The operator injects the forge-pg-cluster-env ConfigMap (in the postgres namespace) into every postgres pod as environment variables. Currently provides:
The ConfigMap is rendered by the carbide-prereqs chart (from Values.siteName) so it flows in at install time and can be overridden per-site with --set siteName=<name>.
ssh-host-key format
ssh-console-rs requires the SSH host key in OpenSSH PEM format (-----BEGIN OPENSSH PRIVATE KEY-----). Helm’s genPrivateKey "ed25519" produces PKCS8 format which the binary rejects at startup. bootstrap_ssh_host_key.sh pre-creates the secret using ssh-keygen before helmfile sync -l name=carbide-prereqs runs. The lookup in templates/_helpers.tpl detects the existing secret and reuses it, so Helm never overwrites it.
Secrets reference
All secrets created by setup. The Vault unseal keys (vault-cluster-keys) are the most sensitive — back them up to a secure location after first install.
ClusterIssuers
ClusterSecretStores
Troubleshooting
carbide-api CrashLoopBackOff — siteConfig parse error
If carbide-api crashes immediately after Phase 6 with a config parse error, the most common cause is empty required fields in the carbideApiSiteConfig TOML block. Fields that must be non-empty:
[networks.admin]—prefixandgateway(empty string crashes the binary)[pools.lo-ip],[pools.vlan-id],[pools.vni]—rangesmust have at least one entry
Check the pod logs for the specific field:
Fix the value in values/ncx-core.yaml and re-run:
DNS resolution failing in pods
On Kubespray clusters, setup.sh deploys the NodeLocal DNSCache DaemonSet automatically. If it is not ready:
On kubeadm clusters, NodeLocal DNSCache is not used — setup.sh checks CoreDNS readyReplicas instead:
Vault TLS bootstrap certificates not Ready
Common cause: cert-manager webhook not ready yet. Wait 30 seconds and re-run Phase 2.
Vault pods stuck in Init or CrashLoop
vault-pki-config Job failing
Common causes:
- Vault still sealed —
kubectl exec -n vault vault-0 -c vault -- vault status carbide-vault-tokenmissing — re-run./unseal_vault.shsite-rootSecret not readable by the Job’s service account
forge-pg-cluster not reaching Running state
Common causes:
local-path-persistentStorageClass missing — re-run Phase 1forge-pg-cluster-envConfigMap missing inpostgresnamespace — re-run Phase 5- Insufficient node resources — tune
postgresql.resourcesinvalues.yaml
DB credentials not appearing in forge-system
The source secret (forge-system.carbide.forge-pg-cluster.credentials.postgresql.acid.zalan.do) is created by the operator only after the cluster reaches Running state. If the ClusterSecretStore shows Invalid, check that the eso-postgres-ns ServiceAccount token exists in the postgres namespace:
forge-roots Secret not appearing
If the label is missing:
Site-agent gRPC connection to carbide-api failing (nil CarbideClient)
The site-agent connects to carbide-api at startup with a 5-second deadline. If the connection fails, the CarbideClient stays nil permanently and all inventory activities panic with a nil-pointer dereference. setup.sh detects this and restarts the StatefulSet automatically, but you can also diagnose manually:
Common causes and fixes:
Temporal namespace not found (site-agent startup panic)
If the site-agent panics on startup with a nil pointer in RegisterCron:
If the namespace for the site UUID is missing, create it manually:
Then restart the site-agent.
MetalLB LoadBalancer services stuck in <pending>
If NCX Core services never get an external IP:
Common causes: