Quick Start Guide

View as Markdown

This guide walks through deploying NICo end-to-end: from building containers to discovering your first managed host. The core deployment is orchestrated by setup.sh in the helm-prereqs/ directory, which installs all prerequisites and NICo components in the correct order.

Before starting, review the Prerequisites for hardware, networking, software, and BMC/OOB requirements.

Step 1 — Build NICo Containers

Build all NICo container images from source on Ubuntu 24.04. This produces images for Infra Controller Core, DPU BFB artifacts, and the admin CLI.

Refer to the Building NICo Containers manual for full build instructions, including x86_64 and aarch64 cross-compilation steps.

Push the built images to your container registry before proceeding.

Step 2 — Prepare the Kubernetes Cluster

NICo requires a Kubernetes cluster with at least three schedulable nodes (Ready, not tainted NoSchedule/NoExecute) for HA Vault and PostgreSQL. NICo does not provision the cluster itself—operators are expected to provision their own Kubernetes cluster that meets the requirements below using their preferred tooling (kubeadm, Kubespray, managed K8s, etc.).

Validated baseline:

ComponentVersion
Kubernetesv1.30.4
kubeletv1.26.15
containerd1.7.1
CNI (Calico)v3.28.1
OSUbuntu 24.04.1 LTS

The cluster must have:

  • net.bridge.bridge-nf-call-iptables=1 and net.ipv4.ip_forward=1 on every node.
  • DNS resolution working (kubernetes.default.svc.cluster.local resolves on every node).
  • Network connectivity to your container registry.

Required tools (local machine)

The following tools must be installed on the machine that you will use to run setup.sh—not on the Kubernetes cluster itself.

ToolMin versionMacLinux
kubectl1.26brew install kubectlsnap install kubectl --classic or binary
helm3.12brew install helmcurl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helmfile0.162brew install helmfilebinary from GitHub releases
helm-diff pluginanyhelm plugin install https://github.com/databus23/helm-diffsame
jq1.6brew install jqapt install jq / yum install jq
ssh-keygenanybuilt-inbuilt-in

The helmfile tool requires the helm-diff plugin. Install it as follows:

$helm plugin install https://github.com/databus23/helm-diff

Step 3 — Configure the Site

Everything in this step must be done before running setup.sh. Skipping any item will either cause setup to fail or result in a deployment with incorrect site configuration that is hard to fix after the fact.

3a. Set Required Environment Variables

$export KUBECONFIG=/path/to/kubeconfig # your cluster kubeconfig
$export REGISTRY_PULL_SECRET=<pull-secret-or-api-key> # your registry pull credential
$export NCX_IMAGE_REGISTRY=my-registry.example.com/ncx # base registry for all NCX images
$export NCX_CORE_IMAGE_TAG=<ncx-core-image-tag> # e.g. v2025.12.30-rc1
$export NCX_REST_IMAGE_TAG=<ncx-rest-image-tag> # e.g. v1.0.4

NCX_IMAGE_REGISTRY is used for both NCX Core (<registry>/nvmetal-carbide) and NCX REST (<registry>/carbide-rest-*). Push all images to this registry before running setup.

Obtain an NGC API key at ngc.nvidia.comAPI KeysGenerate Personal Key.

VariableRequiredDescription
REGISTRY_PULL_SECRETYesPull secret and API key for your image registry. Used to create the image pull secret for both Infra Controller Core and Infra Controller REST.
NCX_IMAGE_REGISTRYYesBase image registry for all Infra Controller images (e.g. my-registry.example.com/ncx). Used for Infra Controller Core (<registry>/nvmetal-carbide) and Infra Controller REST (<registry>/carbide-rest-*).
NCX_CORE_IMAGE_TAGYesInfra Controller Core (infra-controller-core) image tag (e.g. v2025.12.30).
NCX_REST_IMAGE_TAGYesInfra Controller REST (infra-controller-rest) image tag (e.g. v1.0.4).
KUBECONFIGYesPath to your cluster kubeconfig.
NCX_REPONoPath to a local clone of infra-controller-rest. Auto-detected from sibling directories; preflight.sh offers to clone it if not found.
NCX_SITE_UUIDNoStable UUID for this site. Defaults to a1b2c3d4-e5f6-4000-8000-000000000001.

3b. Set your Site Name

Open helm-prereqs/values.yaml and change siteName from the placeholder to your actual site identifier:

1siteName: "mysite" # ← replace "TMP_SITE" with your site name (e.g. "examplesite", "prod-us-east")

This value is injected into every postgres pod as the TMP_SITE environment variable. It must match the sitename in the NCX Core siteConfig block below.

To tune PostgreSQL resources for your node capacity (the defaults are conservative for dev), edit the following values:

1postgresql:
2 instances: 3
3 volumeSize: "10Gi"
4 resources:
5 limits:
6 cpu: "4"
7 memory: "4Gi"
8 requests:
9 cpu: "500m"
10 memory: "1Gi"

3c. Configure NCX Core Site Deployment

Open helm-prereqs/values/ncx-core.yaml and update the following values:

  • API hostname: The external DNS name for the Infra Controller Core API:

    1carbide-api:
    2 hostname: "carbide.mysite.example.com" # ← must resolve to your cluster's ingress/LB
  • siteConfig TOML block: The site identity, network topology, and resource pools. These fields are most likely to differ per site:

    FieldWhat to set
    sitenameShort identifier matching siteName in values.yaml
    initial_domain_nameBase DNS domain for the site (e.g. mysite.example.com)
    dhcp_serversList of DHCP server IPs reachable from bare-metal hosts, or []
    site_fabric_prefixesCIDRs that are part of the site fabric (instance-to-instance traffic)
    deny_prefixesCIDRs instances must not reach (OOB, control plane, management)
    [pools.lo-ip] rangesLoopback IP range allocated to bare-metal hosts
    [pools.vlan-id] rangesVLAN ID allocation range
    [pools.vni] rangesVXLAN Network Identifier range
    [networks.admin]Admin network CIDR, gateway, and MTU
    [networks.<underlay>]Underlay data-plane network(s) — one block per L3 segment

All fields are documented with inline comments in the file.

  • Required fields—do not leave empty: [networks.admin], prefix, and gateway must be set to real values. carbide-api crashes at startup with a parse error if these are empty strings. Similarly, [pools.lo-ip], [pools.vlan-id], and [pools.vni] ranges must be non-empty.

    These fields are safe to leave as empty arrays: dhcp_servers, site_fabric_prefixes, deny_prefixes. Do not delete any field from the TOML block; missing keys cause a different crash than empty ones.

3d. Get the NCX REST Repository

NCX REST (infra-controller-rest) is a separate repository that contains the Helm chart, kustomize bases, and helper scripts that setup.sh uses for Phase 7. It is not bundled inside this repo—you need a local clone before running setup.

Option 1: Let setup.sh handle it automatically (recommended)

setup.sh looks for the repo in these locations in order:

  1. NCX_REPO env var (explicit path—use this if you cloned it somewhere non-standard)
  2. Sibling directories next to this repo: ../carbide-rest, ../ncx-infra-controller-rest, ../ncx
  3. If not found anywhere, preflight.sh offers to clone it for you before setup proceeds

If you place the clone next to this repo (the recommended layout), no env var is needed:

your-workspace/
ncx-infra-controller-core/ ← this repo
ncx-infra-controller-rest/ ← NCX REST repo (clone here)

Option 2: Clone it manually

Use the following commands to clone the repository:

$git clone https://github.com/NVIDIA/infra-controller-rest.git
$# Then either place it as a sibling, or:
$export NCX_REPO=/path/to/infra-controller-rest

3e. Configure NCX REST Authentication

The default configuration uses the dev Keycloak instance that setup.sh deploys automatically. No changes are needed if you’re running a dev/test environment.

For production, or if you are using your own IdP, edit the helm-prereqs/values/ncx-rest.yaml file as follows:

Option 1: Use your own Keycloak or OIDC-compatible IdP

1carbide-rest-api:
2 config:
3 keycloak:
4 enabled: true
5 baseURL: "https://keycloak.mysite.example.com"
6 externalBaseURL: "https://keycloak.mysite.example.com"
7 realm: "your-realm"
8 clientID: "carbide-api"

Option 2: Disable Keycloak and use a generic OIDC issuer

1carbide-rest-api:
2 config:
3 keycloak:
4 enabled: false
5 issuers:
6 - issuer: "https://your-oidc-provider.example.com"
7 audience: "carbide-api"

When keycloak.enabled: false, the Keycloak deployment is still created by setup.sh, but carbide-rest-api will not use it for token validation.

3f. Review site-agent Config

The defaults in helm-prereqs/values/ncx-site-agent.yaml should match the dev postgres instance deployed by setup.sh.

DB_USER and DB_PASSWORD are injected at runtime from the db-creds Kubernetes Secret (created by the carbide-rest-common sub-chart during Phase 7g). The Secret is referenced via secrets.dbCreds in the site-agent values.

For production or a different database, override the Secret name and connection config:

1secrets:
2 dbCreds: my-site-agent-db-secret # Secret must have DB_USER and DB_PASSWORD keys
3
4envConfig:
5 DB_DATABASE: "my-database"
6 DB_ADDR: "my-postgres.my-namespace.svc.cluster.local"

3g. Configure MetalLB

MetalLB provides LoadBalancer IPs for NCX Core services (carbide-api, DHCP, DNS, PXE, SSH console). Without it, those services stay in <pending> state and the site is unreachable.

NTP note: NICo does not run a standalone NTP service. Instead, NTP server addresses are provided to managed hosts via DHCP option 42—configured in the carbide-dhcp chart Kea hook parameters (carbide-ntpserver). Point this to your enterprise NTP servers.

Edit helm-prereqs/values/metallb-config.yaml—this file ships pre-populated with example values. Replace all values labeled # EXAMPLE with your site-specific configuration before running setup.sh.

FieldExample value in fileWhat to put for your site
IPAddressPool.spec.addresses (internal)10.180.126.160/28Your internal VIP CIDR
IPAddressPool.spec.addresses (external)10.180.126.176/28Your external VIP CIDR
BGPPeer.spec.myASN4244766850Your cluster-side ASN (same for all nodes)
BGPPeer.spec.peerASN4244766851/852/853TOR ASN per node (unique per node)
BGPPeer.spec.peerAddress10.180.248.80/82/84TOR switch IP reachable from each node
BGPPeer.spec.nodeSelectors hostnamesrno1-m04-d04-cpu-{1,2,3}Your actual node hostnames (kubectl get nodes)

Add or remove BGPPeer blocks to match your node count, with one block per worker node.

If your environment does not use BGP (local dev, flat network), comment out the BGPPeer and BGPAdvertisement sections and uncomment the L2Advertisement section at the bottom of the file.

3h. Assign Service VIPs

Each NCX Core service that exposes a LoadBalancer needs a specific, stable IP from your MetalLB pool. Without explicit assignments, MetalLB picks IPs randomly on each install, which means your DHCP relay, DNS records, PXE config, and API hostname cannot be pre-configured and will break on redeploy.

Open helm-prereqs/values/ncx-core.yaml and update the VIP for each service:

ServiceValues keyPool to use
carbide-api external APIcarbide-api.externalService.annotationsExternal (client-facing)
carbide-dhcpcarbide-dhcp.externalService.annotationsInternal (cluster-facing)
carbide-dns instance-0carbide-dns.externalService.perPodAnnotations[0]Internal or External
carbide-dns instance-1carbide-dns.externalService.perPodAnnotations[1]Internal or External
carbide-pxecarbide-pxe.externalService.annotationsInternal (cluster-facing)
carbide-ssh-console-rscarbide-ssh-console-rs.externalService.annotationsInternal (cluster-facing)

All IPs must be within the IPAddressPool ranges you defined in values/metallb-config.yaml and must be unique across services.

  • carbide-dhcp Note: externalService.enabled: true must be set explicitly; it defaults to false in the chart.
  • carbide-dns Note: Use perPodAnnotations (a list) rather than annotations because each replica gets its own VIP.
  • carbide-api IP and DNS Note: The carbide-api VIP must resolve in external DNS to the hostname you set in Step 3c.

3i. (Optional) Set a Stable Site UUID

If you want a specific site UUID instead of the default placeholder, set the NCX_SITE_UUID environment variable:

$export NCX_SITE_UUID=<your-uuid> # must be a valid UUID v4

This UUID is used as the Temporal namespace for the site and as the CLUSTER_ID passed to the site-agent. Once set and deployed, changing it requires redeploying the site-agent and re-registering the site.

3j. Validate Configuration

Run the pre-flight check to catch issues before deployment:

$cd helm-prereqs/
$source ./preflight.sh

The preflight.sh script is also run automatically at the start of every setup.sh invocation.

The preflight.sh script checks the following:

CategoryChecks
Environment variablesRequired vars are set; no https:// prefix on registry; version tags start with v; UUID is valid if set; KUBECONFIG path exists if set
Required toolshelm, helmfile, kubectl, jq, ssh-keygen are in PATH
values/metallb-config.yamlFile exists; YAML is valid; at least one IPAddressPool defined; exactly one advertisement mode active (BGP or L2, not both); example placeholder hostnames not still present
Cluster reachabilitykubectl can reach the API server.
Node resourcesAt least three schedulable nodes
Per-node: kernel parametersnet.bridge.bridge-nf-call-iptables=1 and net.ipv4.ip_forward=1 on every node
Per-node: DNSkubernetes.default.svc.cluster.local resolves on every node.
Registry connectivityThe registry host responds to an HTTPS probe.
NCX REST repoResolves the repo from NCX_REPO env var, sibling directories, or offers to clone from GitHub

For air-gapped clusters, the per-node checks pull busybox:1.36 by default. If your cluster cannot reach Docker Hub, set PREFLIGHT_CHECK_IMAGE to a local mirror:

$export PREFLIGHT_CHECK_IMAGE=my-registry.example.com/busybox:1.36

Step 4 — Run the Setup Script

Run the setup.sh script as follows:

$cd helm-prereqs/
$./setup.sh # interactive — prompts before deploying NCX Core and NCX REST
$./setup.sh -y # non-interactive — deploys everything

The setup.sh script installs all prerequisites and NICo components in sequential phases:

PhaseWhat it installs
0DNS check (NodeLocal DNSCache or CoreDNS)
1local-path-provisioner + StorageClasses
1bpostgres-operator (Zalando)
1cMetalLB + site BGP/L2 config
2cert-manager + Vault TLS bootstrap (PKI chain)
3HashiCorp Vault (3-node HA Raft)
4Vault init + unseal + SSH host key
5external-secrets + carbide-prereqs + forge-pg-cluster
6NCX Core (carbide helm release)
7a-7hNCX REST full stack (postgres, Keycloak, Temporal, carbide-rest, site-agent)

The following components are deployed:

local-path-provisioner (raw manifest - StorageClasses for Vault + PostgreSQL PVCs)
metallb (metallb/metallb 0.14.5 - LoadBalancer IPs via BGP or L2)
postgres-operator (zalando/postgres-operator 1.10.1 - manages forge-pg-cluster)
cert-manager (jetstack/cert-manager v1.17.1)
vault (hashicorp/vault 0.25.0, 3-node HA Raft, TLS)
external-secrets (external-secrets/external-secrets 0.14.3)
carbide-prereqs (this Helm chart - forge-system namespace)
NCX Core (../helm - ncx-core.yaml values)
NCX REST (ncx-infra-controller-rest/helm/charts/carbide-rest)
├── carbide-rest-ca-issuer ClusterIssuer (cert-manager.io)
├── postgres StatefulSet (temporal + keycloak + NCX databases)
├── keycloak (dev OIDC IdP, carbide-dev realm)
├── temporal (temporal-helm/temporal, mTLS)
├── carbide-rest (API, cert-manager, workflow, site-manager)
└── carbide-rest-site-agent (StatefulSet, bootstrap via site-manager)

For manual phase-by-phase installation, re-running individual phases, or debugging failures, refer to the Reference Installation guide.

Step 5 — Verify the Site Controller

Before ingesting hosts, verify that all site controller components are healthy.

Check That All Pods Are Running

$kubectl get pods -n forge-system # NCX Core
$kubectl get pods -n carbide-rest # NCX REST
$kubectl get pods -n temporal # Temporal

Verify That the Site-agent Is Connected

$kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --prefix \
> | grep "CarbideClient"

Look for the “successfully connected to server” message in the logs.

Verify That the LoadBalancer IPs Are Assigned

$kubectl get svc -n forge-system | grep LoadBalancer

All LoadBalancer services should have an external IP from your IPAddressPool ranges. If any show <pending>, MetalLB has not assigned an IP. Check BGP session status on your TOR switches and verify values/metallb-config.yaml has correct peer addresses.

Verify That DHCP and PXE Are Serving

$kubectl get svc carbide-dhcp carbide-pxe -n forge-system

Both external IPs should be within your internal VIP pool range.

Acquire a Keycloak Access Token

This section only applies if keycloak.enabled: true in values/ncx-rest.yaml (the default). If you disabled the bundled Keycloak and pointed carbide-rest-api at your own IdP, obtain tokens from that IdP instead.

The setup.sh script deploys a dev Keycloak instance with a carbide realm pre-loaded with the ncx-service client (M2M / client_credentials).

ValueSetting
Token endpointhttp://keycloak.carbide-rest:8082/realms/carbide/protocol/openid-connect/token
grant_typeclient_credentials
client_idncx-service
client_secretcarbide-local-secret
Fetch tokens from inside the cluster only. Do not port-forward Keycloak and request tokens against localhost. The resulting JWT iss claim will not match what carbide-rest-api expects, and the token will be rejected.

Use the helper script, which runs curl from a throw-away in-cluster pod:

$TOKEN=$(helm-prereqs/keycloak/get-token.sh)

Verify the token against carbide-rest-api:

$kubectl run -i --rm --restart=Never --image=curlimages/curl curl-test \
> -n carbide-rest --quiet -- \
> -sf http://carbide-rest-api.carbide-rest:8388/v2/org/ncx/carbide/user/current \
> -H "Authorization: Bearer $TOKEN"

Set up carbidecli and Create your First Site

NICo has two CLIs that serve different purposes:

CLICommunicates withUsed for
carbidecliNICo REST (REST API)Site management, org bootstrap, instance operations
carbide-admin-cliNICo Core (gRPC API)Host ingestion, credentials, expected machines, TPM approval

carbidecli is built from the NCX REST repo. carbide-admin-cli is built from the NCX Core repo (crates/admin-cli).

1. Build and Install the CLI

$cd "$NCX_REPO"
$make carbide-cli # installs to $(go env GOPATH)/bin/carbidecli

2. Generate the Default Config File

$carbidecli init # writes ~/.carbide/config.yaml

3. Port-forward carbide-rest-api to localhost

$kubectl port-forward -n carbide-rest svc/carbide-rest-api 8388:8388

4. Edit ~/.carbide/config.yaml

1api:
2 base: http://localhost:8388
3 org: ncx
4 name: carbide
5auth:
6 token: <paste value of $TOKEN here>

5. Bootstrap the Org (Required One-Time Call)

This GET endpoint lazily initializes the org on first call as follows:

  1. Checks if service account is enabled in the auth config
  2. Creates an InfrastructureProvider for the org if one doesn’t exist
  3. Creates a Tenant with targeted instance creation enabled if one doesn’t exist
  4. Creates a TenantAccount linking the provider and tenant if one doesn’t exist
  5. Returns the service account status with the provider and tenant IDs

Without this call, site create returns 404. Subsequent calls are read-only.

$TOKEN=$(helm-prereqs/keycloak/get-token.sh)
$
$curl -sS -H "Authorization: Bearer $TOKEN" \
> http://localhost:8388/v2/org/ncx/carbide/service-account/current \
> | python3 -m json.tool

6. Create your First Site

$carbidecli site create --name mysite --description 'first site'
$carbidecli site list

Overall Health Check

Run the following commands to verify that all components are healthy:

$kubectl get clusterissuer
$kubectl get clustersecretstore
$kubectl get pods -n metallb-system
$kubectl get ipaddresspool,bgppeer -n metallb-system
$kubectl get pods -n postgres
$kubectl get pods -n forge-system
$kubectl get jobs -n forge-system
$kubectl get secret forge-roots -n forge-system
$kubectl get secret forge-system.carbide.forge-pg-cluster.credentials -n forge-system
$kubectl get pods -n carbide-rest
$kubectl get pods -n temporal
$kubectl get certificate core-grpc-client-site-agent-certs -n carbide-rest

For troubleshooting common issues, refer to the Reference Installation — Troubleshooting guide.

Step 6 — Connect the OOB Network

Configure the out-of-band network to relay BMC DHCP requests to the NICo DHCP service.

  1. Configure the DHCP relay on your OOB switches to forward DHCP requests to the carbide-dhcp LoadBalancer VIP (assigned in Step 3h).

  2. Verify DHCP requests are reaching NICo by checking the DHCP service logs:

    $kubectl logs -n forge-system -l app.kubernetes.io/name=carbide-dhcp --tail=20

For detailed OOB network requirements, refer to the BMC and Out-of-Band Setup guide.

Step 7 — Discover Your First Host

This step uses carbide-admin-cli, the gRPC CLI for NICo Core. Build it from the NCX Core repo:

$cd ncx-infra-controller-core/
$cargo build --release -p carbide-admin-cli
$# Binary: target/release/carbide-admin-cli

Alternatively, use the containerized version bundled in the carbide-api pod (available at /opt/carbide/forge-admin-cli inside the container).

The <api-url> in the commands below is the NICo Core gRPC API endpoint. This is the carbide-api hostname configured in Step 3c, not the REST API used in Step 5. The format is typically https://api-<ENVIRONMENT_NAME>.<SITE_DOMAIN_NAME>. You can also retrieve it from the LoadBalancer VIP:

$kubectl get svc carbide-api -n forge-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Set Site-wide Credentials

Configure the credentials NICo will apply to BMCs and UEFI after ingestion:

$carbide-admin-cli -c <api-url> credential add-bmc --kind=site-wide-root --password='<password>'
$carbide-admin-cli -c <api-url> host generate-host-uefi-password
$carbide-admin-cli -c <api-url> credential add-uefi --kind=host --password='<password>'

Upload the Expected Machines Manifest

Prepare an expected_machines.json with the BMC MAC address, factory default credentials, and chassis serial number for each host:

1{
2 "expected_machines": [
3 {
4 "bmc_mac_address": "C4:5A:B1:C8:38:0D",
5 "bmc_username": "root",
6 "bmc_password": "default-password",
7 "chassis_serial_number": "SERIAL-1"
8 }
9 ]
10}

Upload the manifest:

$carbide-admin-cli -c <api-url> em replace-all --filename expected_machines.json

Approve the host for ingestion

NICo uses Measured Boot with TPM v2.0 to enforce cryptographic identity:

$carbide-admin-cli -c <api-url> mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6"

NICo will now discover the host via Redfish, pair it with its DPU(s), provision the DPU, and bring the host to a ready state. For more details, refer to the Ingesting Hosts guide.

Monitor Host Discovery

$kubectl logs -n forge-system -l app.kubernetes.io/name=carbide-api --tail=50 \
> | grep -i "site explorer\|bmc\|discovery"

Teardown

To perform teardown, run the following command:

$cd helm-prereqs/
$./clean.sh

This removes NCX REST, NCX Core, all helmfile releases, cluster-scoped resources, namespaces, and released PersistentVolumes. For details on what clean.sh does and the removal order, refer to the Reference Installation guide.