> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/infra-controller/llms.txt.
> For full documentation content, see https://docs.nvidia.com/infra-controller/llms-full.txt.

# Reference Installation — Manual Phase-by-Phase

This guide breaks down every phase of NICo's `setup.sh` installation with the exact commands being run. Use this if you need to re-run a single phase, debug a failure, or understand what the script does before running it.

For the automated end-to-end installation using `setup.sh`, see the [Quick Start Guide](/infra-controller/documentation/getting-started/quick-start-guide).

**Prerequisites:** complete all configuration steps in [Step 3 of the Quick Start Guide](/infra-controller/documentation/getting-started/quick-start-guide#step-3--configure-the-site) before running any phase manually.

All commands below assume you are in the `helm-prereqs/` directory with the required environment variables set:

```bash
cd helm-prereqs/
export KUBECONFIG=/path/to/kubeconfig
export REGISTRY_PULL_SECRET=<your-pull-secret>
export NCX_IMAGE_REGISTRY=<your-registry>
export NCX_CORE_IMAGE_TAG=<ncx-core-tag>
export NCX_REST_IMAGE_TAG=<ncx-rest-tag>
export NCX_REPO=/path/to/ncx-infra-controller-rest   # or let preflight auto-detect
```

---

## Phase 0 — DNS check

Detects cluster type and verifies DNS is ready before any workloads are deployed.

- **Kubespray clusters** — checks if the `nodelocaldns` DaemonSet is ready; deploys `operators/nodelocaldns-daemonset.yaml` if missing and waits for rollout
- **kubeadm / other** — checks CoreDNS readyReplicas >= 1; warns but does not fail if not ready

```bash
# Kubespray: deploy NodeLocal DNSCache if missing
if kubectl get configmap nodelocaldns -n kube-system &>/dev/null; then
    kubectl apply -f operators/nodelocaldns-daemonset.yaml 2>/dev/null || true
    kubectl rollout status daemonset/nodelocaldns -n kube-system --timeout=120s
else
    # kubeadm: just verify CoreDNS is up
    kubectl get deployment coredns -n kube-system
fi
```

---

## Phase 1 — local-path-provisioner

Deploys StorageClasses for Vault and PostgreSQL PVCs. The `local-path-persistent` StorageClass uses `reclaimPolicy: Retain` so data survives pod deletion and node restarts.

```bash
kubectl apply -f operators/local-path-provisioner.yaml
# Delete before re-apply - the provisioner field is immutable
kubectl delete -f operators/storageclass-local-path-persistent.yaml --ignore-not-found 2>/dev/null || true
kubectl apply -f operators/storageclass-local-path-persistent.yaml
kubectl rollout status deployment/local-path-provisioner -n local-path-storage --timeout=120s
# Mark local-path as the cluster default StorageClass
kubectl annotate storageclass local-path \
    storageclass.kubernetes.io/is-default-class=true --overwrite
```

---

## Phase 1b — postgres-operator

Installs the Zalando PostgreSQL Operator. Must be up before Phase 5 creates the `forge-pg-cluster` resource — the `postgresql.acid.zalan.do` CRD must be registered first.

```bash
helmfile sync -l name=postgres-operator
```

---

## Phase 1c — MetalLB

Installs MetalLB 0.14.5 with the FRR BGP speaker, then applies your site-specific IP pool and BGP configuration.

```bash
helmfile sync -l name=metallb
kubectl wait --for=condition=Available deployment/metallb-controller \
    -n metallb-system --timeout=120s
kubectl apply -f values/metallb-config.yaml
```

Expected result: MetalLB controller and speaker pods running in `metallb-system`. BGPPeer sessions established with your TOR switches.

---

## Phase 2 — cert-manager + Vault TLS bootstrap

Three sub-steps — all must complete before Phase 3 (Vault).

### 2a — cert-manager

```bash
helmfile sync -l name=cert-manager
```

### 2b — Vault TLS bootstrap

Vault requires TLS to start — but the Vault-backed issuer can't exist before Vault is running. This step breaks the chicken-and-egg problem by using `site-issuer` (backed by `site-root` CA) to issue Vault's own TLS certs before Vault starts.

```bash
kubectl create namespace vault --dry-run=client -o yaml | kubectl apply -f -
# Run from the helm-prereqs/ directory (the chart root)
helm template carbide-prereqs . \
    --show-only templates/site-root-certificate.yaml \
    --show-only templates/vault-tls-certs.yaml \
    | kubectl apply --server-side --field-manager=helm -f -
# Wait for all three certs to be issued
kubectl wait --for=condition=Ready certificate/site-root -n cert-manager --timeout=120s
kubectl wait --for=condition=Ready certificate/forgeca-vault-client -n vault --timeout=120s
kubectl wait --for=condition=Ready certificate/vault-raft-tls -n vault --timeout=120s
```

---

## Phase 3 — Vault

Installs HashiCorp Vault 0.25.0 in 3-replica HA Raft mode. TLS secrets exist in the `vault` namespace by this point so pods start immediately.

```bash
helmfile sync -l name=vault
```

---

## Phase 4 — Initialize and unseal Vault

```bash
./unseal_vault.sh
./bootstrap_ssh_host_key.sh
```

`unseal_vault.sh` handles both first-run init and re-unseal on subsequent runs:
- First run: `vault operator init -key-shares=5 -key-threshold=3`, stores init JSON as `vault-cluster-keys` secret, unseals all three pods
- Creates the `forge-system` namespace with Helm ownership labels
- Copies root token to `carbide-vault-token` in `forge-system` for the `vault-pki-config` Job

`bootstrap_ssh_host_key.sh` pre-creates the `ssh-host-key` Secret in OpenSSH PEM format (idempotent — skips if the secret already exists).

To verify Vault is unsealed:

```bash
kubectl exec -n vault vault-0 -c vault -- vault status
```

---

## Phase 5 — external-secrets + carbide-prereqs

```bash
helmfile sync -l name=external-secrets
helmfile sync -l name=carbide-prereqs
```

After `carbide-prereqs` installs, wait for the PostgreSQL cluster to provision and for ESO to sync credentials:

```bash
# Wait for the Patroni cluster to reach Running state (can take 3-5 minutes)
kubectl wait --for=jsonpath='{.status.PostgresClusterStatus}'=Running \
    postgresql/forge-pg-cluster -n postgres --timeout=600s

# Verify ESO synced the DB credentials into forge-system
kubectl get secret forge-system.carbide.forge-pg-cluster.credentials -n forge-system
```

---

## Phase 6 — NCX Core

Deploys the main NCX Core application chart. Run from the **repo root** (`ncx-infra-controller-core/`), not from `helm-prereqs/`.

```bash
cd ..   # repo root (ncx-infra-controller-core/)
helm upgrade --install carbide ./helm \
    --namespace forge-system \
    -f helm-prereqs/values/ncx-core.yaml \
    --set global.image.repository="${NCX_IMAGE_REGISTRY}/nvmetal-carbide" \
    --set global.image.tag="${NCX_CORE_IMAGE_TAG}" \
    --timeout 600s --wait
```

Verify LoadBalancer IPs were assigned from your MetalLB pool:

```bash
kubectl get svc -n forge-system | grep LoadBalancer
```

---

## Phase 7 — NCX REST (carbide-rest)

All sub-steps run from the NCX REST repo directory (`$NCX_REPO`).

### 7a — CA signing secret

Generates the `ca-signing-secret` used by the `carbide-rest-ca-issuer` ClusterIssuer for Temporal mTLS. Idempotent — skips if the secret already exists.

```bash
(cd "${NCX_REPO}" && bash scripts/gen-site-ca.sh)
```

### 7b — carbide-rest-ca-issuer

```bash
(cd "${NCX_REPO}" && kubectl apply -k deploy/kustomize/base/cert-manager-io)
```

### 7c — NCX REST postgres

```bash
(cd "${NCX_REPO}" && kubectl apply -k deploy/kustomize/base/postgres)
kubectl rollout status statefulset/postgres -n postgres --timeout=300s
```

### 7d — Keycloak

```bash
(cd "${NCX_REPO}" && kubectl apply -k deploy/kustomize/base/keycloak -n carbide-rest)
kubectl rollout status deployment/keycloak -n carbide-rest --timeout=300s
```

### 7e — Temporal TLS bootstrap

```bash
(cd "${NCX_REPO}" && kubectl apply -f deploy/kustomize/base/temporal-helm/namespace.yaml)
(cd "${NCX_REPO}" && kubectl apply -f deploy/kustomize/base/temporal-helm/db-creds.yaml)
(cd "${NCX_REPO}" && kubectl apply -f deploy/kustomize/base/temporal-helm/certificates.yaml)
# Wait for the three mTLS certs to be issued by carbide-rest-ca-issuer
kubectl wait --for=condition=Ready certificate/server-interservice-cert -n temporal --timeout=120s
kubectl wait --for=condition=Ready certificate/server-cloud-cert -n temporal --timeout=120s
kubectl wait --for=condition=Ready certificate/server-site-cert -n temporal --timeout=120s
```

### 7f — Temporal

```bash
helm upgrade --install temporal "${NCX_REPO}/temporal-helm/temporal" \
    --namespace temporal \
    -f "${NCX_REPO}/temporal-helm/temporal/values-kind.yaml" \
    --timeout 600s --wait

# Create the Temporal namespaces for NCX REST workers
_TEMPORAL_ADDR="temporal-frontend.temporal:7233"
_TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \
    --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
    --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
    --tls-server-name interservice.server.temporal.local"
kubectl exec -n temporal deploy/temporal-admintools -- \
    sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
kubectl exec -n temporal deploy/temporal-admintools -- \
    sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true
```

### 7g — NCX REST helm chart

```bash
# Build the image pull secret dockerconfigjson
_ncx_docker_cfg="$(printf '{"auths":{"nvcr.io":{"username":"$oauthtoken","password":"%s"}}}' \
    "${REGISTRY_PULL_SECRET}" | base64 | tr -d '\n')"

helm upgrade --install carbide-rest "${NCX_REPO}/helm/charts/carbide-rest" \
    --namespace carbide-rest \
    -f values/ncx-rest.yaml \
    --set global.image.repository="${NCX_IMAGE_REGISTRY}" \
    --set global.image.tag="${NCX_REST_IMAGE_TAG}" \
    --set "carbide-rest-common.secrets.imagePullSecret.dockerconfigjson=${_ncx_docker_cfg}" \
    --timeout 600s --wait
```

### 7h — NCX REST site-agent

The deployment order is critical — do not skip steps.

```bash
NCX_SITE_UUID="${NCX_SITE_UUID:-a1b2c3d4-e5f6-4000-8000-000000000001}"
NCX_SITE_AGENT_CHART="${NCX_REPO}/helm/charts/carbide-rest-site-agent"

# Step 1 - pre-apply the gRPC client cert so it exists before the pod starts
helm template carbide-rest-site-agent "${NCX_SITE_AGENT_CHART}" \
    --namespace carbide-rest \
    -f values/ncx-site-agent.yaml \
    --set global.image.repository="${NCX_IMAGE_REGISTRY}" \
    --set global.image.tag="${NCX_REST_IMAGE_TAG}" \
    --show-only templates/certificate.yaml | kubectl apply -f -
kubectl annotate certificate/core-grpc-client-site-agent-certs -n carbide-rest \
    "meta.helm.sh/release-name=carbide-rest-site-agent" \
    "meta.helm.sh/release-namespace=carbide-rest" --overwrite
kubectl label certificate/core-grpc-client-site-agent-certs -n carbide-rest \
    "app.kubernetes.io/managed-by=Helm" --overwrite
kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs \
    -n carbide-rest --timeout=120s

# Step 2 - create per-site Temporal namespace (site-agent panics without it)
_TEMPORAL_ADDR="temporal-frontend.temporal:7233"
_TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \
    --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
    --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
    --tls-server-name interservice.server.temporal.local"
kubectl exec -n temporal deploy/temporal-admintools -- \
    sh -c "temporal operator namespace create -n '${NCX_SITE_UUID}' --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true

# Step 3 - install site-agent (pre-install hook registers site and creates site-registration secret)
helm upgrade --install carbide-rest-site-agent "${NCX_SITE_AGENT_CHART}" \
    --namespace carbide-rest \
    -f values/ncx-site-agent.yaml \
    --set global.image.repository="${NCX_IMAGE_REGISTRY}" \
    --set global.image.tag="${NCX_REST_IMAGE_TAG}" \
    --set "envConfig.CLUSTER_ID=${NCX_SITE_UUID}" \
    --set "envConfig.TEMPORAL_SUBSCRIBE_NAMESPACE=${NCX_SITE_UUID}" \
    --set "envConfig.TEMPORAL_SUBSCRIBE_QUEUE=site" \
    --timeout 300s --wait

# Step 4 - verify gRPC connection to carbide-api
kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --prefix \
    | grep "CarbideClient:"
```

---

## PKI architecture

The PKI has three layers, built bottom-up:

```
selfsigned-bootstrap ClusterIssuer
  └── site-root CA Certificate  (10-year self-signed CA, Secret "site-root" in cert-manager ns)
        └── site-issuer ClusterIssuer  (issues Vault's own TLS certs - no Vault dependency)
              ├── forgeca-vault-client  (Vault port 8200 listener TLS, Secret in vault ns)
              └── vault-raft-tls        (Vault Raft port 8201 peer TLS, Secret in vault ns)

vault (running, unsealed)
  └── vault-pki-config Job  (imports site-root CA into Vault PKI engine "forgeca")
        └── vault-forge-issuer ClusterIssuer  (issues all workload SPIFFE certs via Vault PKI)
```

NCX REST has its own parallel PKI chain for internal services:

```
carbide-rest-ca-issuer ClusterIssuer  (backed by ca-signing-secret in carbide-rest ns)
  └── Temporal mTLS certificates      (server-interservice-cert, server-cloud-cert, server-site-cert)

vault-forge-issuer ClusterIssuer      (same Vault PKI CA as NCX Core)
  └── site-agent gRPC client cert     (core-grpc-client-site-agent-certs in carbide-rest ns)
        SPIFFE URI: spiffe://forge.local/forge-system/sa/elektra-site-agent
```

The site-agent uses the Vault PKI CA for both directions of mTLS with carbide-api:
- Site-agent presents its client cert (Vault-signed) — carbide-api trusts it via the same CA.
- Site-agent verifies carbide-api's server cert using `ca.crt` from the issued secret (Vault PKI CA).

### Layer 1 — Bootstrap (no external dependencies)

`selfsigned-bootstrap` is a cert-manager `selfSigned` ClusterIssuer with no dependencies. It issues `site-root`: a 10-year CA certificate stored as Secret `site-root` in the `cert-manager` namespace. This is the trust anchor for the entire cluster.

### Layer 2 — site-issuer (Vault TLS bootstrap)

`site-issuer` is a `ca` ClusterIssuer backed by `site-root`. It can issue certificates without Vault being up.

**This solves the Vault TLS chicken-and-egg problem.** Vault requires TLS to start — but `vault-forge-issuer` (the Vault-backed issuer) can't exist before Vault is running. `site-issuer` breaks the cycle by issuing Vault's own TLS secrets before Vault starts:

| Secret | Namespace | Purpose |
|--------|-----------|---------|
| `forgeca-vault-client` | `vault` | Port 8200 listener cert (mounted at `/vault/userconfig/forgeca-vault/`) |
| `vault-raft-tls` | `vault` | Raft port 8201 peer cert (mounted at `/vault/userconfig/vault-raft-tls/`) |

These secrets must exist **before** `helmfile sync -l name=vault` — setup.sh creates them explicitly in Phase 2 using `helm template | kubectl apply`.

### Layer 3 — vault-forge-issuer (workload PKI)

Once Vault is running and unsealed, the `vault-pki-config` Job (Helm post-install hook) configures Vault as a PKI backend:

1. Enables the `forgeca` PKI secrets engine, tunes it to a 10-year max TTL.
2. Imports `site-root` (cert + key) into Vault PKI — Vault becomes an intermediate CA under the same trust root.
3. Creates PKI role `forge-cluster` — allows any name, allows SPIFFE URI SANs, 720h max TTL, EC P-256.
4. Enables Kubernetes auth and writes two policies: `cert-manager-forge-policy` (sign via PKI) and `forge-vault-policy` (read KV secrets).
5. Enables KV v2 at `secrets/` and AppRole auth for the `carbide` role.

`vault-forge-issuer` is then created as a cert-manager ClusterIssuer authenticating to Vault via Kubernetes auth. All NCX Core workload SPIFFE certificates and the site-agent's gRPC client certificate are issued through this issuer.

### forge-roots — CA distribution

The `forge-roots` Secret (containing `site-root`'s `ca.crt`) must be present in every namespace where NCX workloads run so pods can verify each other's SPIFFE certificates.

```
site-root Secret (cert-manager ns)
  → ClusterSecretStore "cert-manager-ns-secretstore" (Kubernetes provider)
    → ClusterExternalSecret "forge-roots-eso"
      → ExternalSecret in every namespace labeled carbide.nvidia.com/managed=true
        → Secret "forge-roots" (ca.crt)
```

`creationPolicy: Orphan` prevents Kubernetes GC from cascading a delete to `forge-roots` if the ExternalSecret is recreated on helm upgrade.

---

## PostgreSQL architecture

PostgreSQL is deployed as a production-grade 3-node HA cluster managed by the **Zalando PostgreSQL Operator** (`acid.zalan.do`). NCX REST also deploys its own simpler postgres StatefulSet in the same `postgres` namespace for temporal, keycloak, and NCX REST databases.

```
postgres-operator (postgres ns)
  └── forge-pg-cluster postgresql CRD (postgres ns)        ← NCX Core
        ├── forge-pg-cluster-0  (Patroni leader)
        ├── forge-pg-cluster-1  (Patroni replica)
        └── forge-pg-cluster-2  (Patroni replica)
              each pod: postgres + postgres-exporter sidecar

postgres StatefulSet (postgres ns, service: postgres)      ← NCX REST
  └── Databases: forge, temporal, temporal_visibility, keycloak, elektratest
```

### Credential flow (NCX Core)

The operator automatically creates a per-user credential Secret in the `postgres` namespace:
```
forge-system.carbide.forge-pg-cluster.credentials.postgresql.acid.zalan.do
  username: forge-system.carbide
  password: <operator-generated>
```

ESO's `carbide-db-eso` ClusterExternalSecret mirrors this into `forge-system` as:
```
forge-system.carbide.forge-pg-cluster.credentials
  username: forge-system.carbide
  password: <same>
```

### forge-pg-cluster-env ConfigMap

The operator injects the `forge-pg-cluster-env` ConfigMap (in the `postgres` namespace) into every postgres pod as environment variables. Currently provides:

```
TMP_SITE = <Values.siteName>
```

The ConfigMap is rendered by the `carbide-prereqs` chart (from `Values.siteName`) so it flows in at install time and can be overridden per-site with `--set siteName=<name>`.

### ssh-host-key format

`ssh-console-rs` requires the SSH host key in **OpenSSH PEM format** (`-----BEGIN OPENSSH PRIVATE KEY-----`). Helm's `genPrivateKey "ed25519"` produces PKCS8 format which the binary rejects at startup. `bootstrap_ssh_host_key.sh` pre-creates the secret using `ssh-keygen` before `helmfile sync -l name=carbide-prereqs` runs. The `lookup` in `templates/_helpers.tpl` detects the existing secret and reuses it, so Helm never overwrites it.

---

## Secrets reference

All secrets created by setup. The Vault unseal keys (`vault-cluster-keys`) are the most sensitive — back them up to a secure location after first install.

| Secret | Namespace | Created by | Purpose |
|--------|-----------|------------|---------|
| `site-root` | `cert-manager` | cert-manager (selfsigned-bootstrap) | Self-signed root CA cert + key. Trust anchor for all PKI. |
| `forgeca-vault-client` | `vault` | cert-manager (site-issuer) | Vault port 8200 TLS listener cert |
| `vault-raft-tls` | `vault` | cert-manager (site-issuer) | Vault Raft port 8201 TLS peer cert |
| `vault-cluster-keys` | `vault` | `unseal_vault.sh` | Full Vault init JSON (5 unseal keys + root token). **Back this up.** |
| `vaultunsealkeys` | `vault` | `unseal_vault.sh` | Individual unseal keys (0-4) for automated re-unseal |
| `vaultroottoken` | `vault` | `unseal_vault.sh` | Vault root token. Limit use after setup. |
| `forge-system.carbide.forge-pg-cluster.credentials.postgresql.acid.zalan.do` | `postgres` | Zalando operator | Operator-generated DB credentials (source of truth) |
| `carbide-vault-token` | `forge-system` | `unseal_vault.sh` | Root token copy for `vault-pki-config` Job |
| `carbide-vault-approle-tokens` | `forge-system` | `vault-pki-config` Job | AppRole role-id and secret-id for NCX Core services |
| `nvcr-carbide-dev` | `forge-system` | `carbide-prereqs` chart | Image pull secret for NCX Core registry |
| `ssh-host-key` | `forge-system` | `bootstrap_ssh_host_key.sh` | ed25519 host key for `carbide-ssh-console-rs` in OpenSSH format |
| `forge-roots` | `forge-system` | ESO (forge-roots-eso) | Site-root CA cert (`ca.crt`) for SPIFFE cert verification |
| `forge-system.carbide.forge-pg-cluster.credentials` | `forge-system` | ESO (carbide-db-eso) | DB credentials mirrored from `postgres` ns for `carbide-api` |
| `ca-signing-secret` | `carbide-rest` | `gen-site-ca.sh` | NCX REST internal CA for Temporal mTLS |
| `core-grpc-client-site-agent-certs` | `carbide-rest` | cert-manager (vault-forge-issuer) | Site-agent mTLS client cert for carbide-api gRPC |

### ClusterIssuers

| Name | Backed by | Issues |
|------|-----------|--------|
| `selfsigned-bootstrap` | cert-manager selfSigned | `site-root` CA only |
| `site-issuer` | `site-root` CA Secret | Vault TLS certs (`forgeca-vault-client`, `vault-raft-tls`) |
| `vault-forge-issuer` | Vault PKI engine (`forgeca/sign/forge-cluster`) | All NCX Core SPIFFE certs + site-agent gRPC client cert |
| `carbide-rest-ca-issuer` | `ca-signing-secret` | Temporal mTLS certs |

### ClusterSecretStores

| Name | Reads from | Used for |
|------|------------|---------|
| `cert-manager-ns-secretstore` | `cert-manager` namespace | Syncing `site-root` CA to `forge-roots` |
| `postgres-ns-secretstore` | `postgres` namespace | Syncing operator DB credentials to `forge-system` |

## Troubleshooting

### carbide-api CrashLoopBackOff — siteConfig parse error

If `carbide-api` crashes immediately after Phase 6 with a config parse error, the most common cause is empty required fields in the `carbideApiSiteConfig` TOML block. Fields that must be non-empty:

- `[networks.admin]` — `prefix` and `gateway` (empty string crashes the binary)
- `[pools.lo-ip]`, `[pools.vlan-id]`, `[pools.vni]` — `ranges` must have at least one entry

Check the pod logs for the specific field:
```bash
kubectl logs -n forge-system -l app.kubernetes.io/name=carbide-api --previous
```

Fix the value in `values/ncx-core.yaml` and re-run:
```bash
helm upgrade carbide ./helm --namespace forge-system -f helm-prereqs/values/ncx-core.yaml \
    --set global.image.repository="${NCX_IMAGE_REGISTRY}/nvmetal-carbide" \
    --set global.image.tag="${NCX_CORE_IMAGE_TAG}"
```

### DNS resolution failing in pods

On **Kubespray clusters**, setup.sh deploys the NodeLocal DNSCache DaemonSet automatically. If it is not ready:
```bash
kubectl get daemonset nodelocaldns -n kube-system
kubectl apply -f operators/nodelocaldns-daemonset.yaml
kubectl rollout status daemonset/nodelocaldns -n kube-system
```

On **kubeadm clusters**, NodeLocal DNSCache is not used — setup.sh checks CoreDNS readyReplicas instead:
```bash
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl rollout restart deployment/coredns -n kube-system
```

### Vault TLS bootstrap certificates not Ready

```bash
kubectl get certificate -n cert-manager
kubectl get certificate -n vault
kubectl describe certificate forgeca-vault-client -n vault
```

Common cause: cert-manager webhook not ready yet. Wait 30 seconds and re-run Phase 2.

### Vault pods stuck in Init or CrashLoop

```bash
kubectl get secret forgeca-vault-client vault-raft-tls -n vault
kubectl logs vault-0 -n vault -c vault
```

### vault-pki-config Job failing

```bash
kubectl logs -n forge-system job/vault-pki-config -c wait-vault
kubectl logs -n forge-system job/vault-pki-config -c configure
```

Common causes:
- Vault still sealed — `kubectl exec -n vault vault-0 -c vault -- vault status`
- `carbide-vault-token` missing — re-run `./unseal_vault.sh`
- `site-root` Secret not readable by the Job's service account

### forge-pg-cluster not reaching Running state

```bash
kubectl get postgresql forge-pg-cluster -n postgres
kubectl describe postgresql forge-pg-cluster -n postgres
kubectl get pods -n postgres
kubectl logs -n postgres forge-pg-cluster-0 -c postgres
```

Common causes:
- `local-path-persistent` StorageClass missing — re-run Phase 1
- `forge-pg-cluster-env` ConfigMap missing in `postgres` namespace — re-run Phase 5
- Insufficient node resources — tune `postgresql.resources` in `values.yaml`

### DB credentials not appearing in forge-system

```bash
kubectl get clustersecretstore postgres-ns-secretstore
kubectl get clusterexternalsecret carbide-db-eso
kubectl describe externalsecret -n forge-system
```

The source secret (`forge-system.carbide.forge-pg-cluster.credentials.postgresql.acid.zalan.do`) is created by the operator only after the cluster reaches `Running` state. If the ClusterSecretStore shows `Invalid`, check that the `eso-postgres-ns` ServiceAccount token exists in the `postgres` namespace:
```bash
kubectl get secret eso-postgres-ns-token -n postgres
```

### forge-roots Secret not appearing

```bash
kubectl get clustersecretstore cert-manager-ns-secretstore
kubectl get clusterexternalsecret forge-roots-eso
kubectl get namespace forge-system --show-labels
# Should include: carbide.nvidia.com/managed=true
```

If the label is missing:
```bash
kubectl label namespace forge-system carbide.nvidia.com/managed=true
```

### Site-agent gRPC connection to carbide-api failing (nil CarbideClient)

The site-agent connects to carbide-api at startup with a 5-second deadline. If the connection fails, the `CarbideClient` stays nil permanently and all inventory activities panic with a nil-pointer dereference. setup.sh detects this and restarts the StatefulSet automatically, but you can also diagnose manually:

```bash
# Check which pods connected successfully
kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --prefix \
    | grep -E "CarbideClient: (successfully connected|failed to get version)"

# Check mTLS cert was issued
kubectl get certificate core-grpc-client-site-agent-certs -n carbide-rest

# Check the cert was projected into the pod
kubectl exec -n carbide-rest carbide-rest-site-agent-0 -- ls /etc/carbide-certs/

# Check DNS resolution of carbide-api from the pod
kubectl exec -n carbide-rest carbide-rest-site-agent-0 -- \
    nslookup carbide-api.forge-system.svc.cluster.local
```

Common causes and fixes:

| Symptom | Cause | Fix |
|---------|-------|-----|
| `DeadlineExceeded` in pod logs | DNS cold cache on the node at startup | `kubectl rollout restart statefulset/carbide-rest-site-agent -n carbide-rest` |
| `certificate signed by unknown authority` | Site-agent cert issued by wrong CA | Check `values/ncx-site-agent.yaml` — `global.certificate.issuerRef.name` must be `vault-forge-issuer` |
| `Unauthenticated` from carbide-api | SPIFFE URI does not match `InternalRBACRules` | Check `values/ncx-site-agent.yaml` — `certificate.uris` must be `spiffe://forge.local/forge-system/sa/elektra-site-agent` |
| `transport: error while dialing` | Wrong `CARBIDE_SEC_OPT` | Check `envConfig.CARBIDE_SEC_OPT: "2"` in `ncx-site-agent.yaml` (2 = MutualTLS) |
| cert secret missing at pod start | Race: StatefulSet started before cert was issued | Re-run Phase 7h — pre-apply Certificate step ensures cert exists first |

### Temporal namespace not found (site-agent startup panic)

If the site-agent panics on startup with a nil pointer in `RegisterCron`:
```bash
kubectl exec -n temporal deploy/temporal-admintools -- \
    sh -c "temporal operator namespace list --address temporal-frontend.temporal:7233 \
        --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \
        --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \
        --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \
        --tls-server-name interservice.server.temporal.local"
```

If the namespace for the site UUID is missing, create it manually:
```bash
kubectl exec -n temporal deploy/temporal-admintools -- \
    sh -c "temporal operator namespace create -n '<site-uuid>' \
        --address temporal-frontend.temporal:7233 ..."
```
Then restart the site-agent.

### MetalLB LoadBalancer services stuck in `<pending>`

If NCX Core services never get an external IP:

```bash
kubectl get pods -n metallb-system
kubectl get ipaddresspool -n metallb-system
kubectl get bgppeer -n metallb-system
kubectl describe bgppeer -n metallb-system
kubectl logs -n metallb-system -l app=metallb,component=speaker --tail=50
kubectl get svc -n forge-system -l app.kubernetes.io/name=carbide-api
```

Common causes:

| Symptom | Cause | Fix |
|---------|-------|-----|
| `IPAddressPool` not found | `values/metallb-config.yaml` was not applied | Re-run `kubectl apply -f values/metallb-config.yaml` |
| BGP session `Idle` / never establishes | Wrong `peerAddress` or ASN, or firewall blocking TCP 179 | Verify with your network team |
| BGP session up but no IP assigned | IP pool addresses exhausted or CIDR is wrong | Check `kubectl describe ipaddresspool -n metallb-system` |
| All services pending after MetalLB looks healthy | FRR speaker not running | Set `speaker.frr.enabled: true` in `operators/values/metallb.yaml` and re-run Phase 1c |

### Checking overall health after setup

```bash
kubectl get clusterissuer
kubectl get clustersecretstore
kubectl get pods -n metallb-system
kubectl get ipaddresspool,bgppeer -n metallb-system
kubectl get pods -n postgres
kubectl get pods -n forge-system
kubectl get jobs -n forge-system
kubectl get secret forge-roots -n forge-system
kubectl get secret forge-system.carbide.forge-pg-cluster.credentials -n forge-system
kubectl get pods -n carbide-rest
kubectl get pods -n temporal
kubectl get certificate core-grpc-client-site-agent-certs -n carbide-rest
```