Diagnostic Tools

View as Markdown

Use this page as a command reference while investigating a stuck object or site operation incident.

CLI Setup

nico-admin-cli is the primary operator CLI for NICo site state.

$cargo build -p carbide-admin-cli

Common connection options:

OptionMeaning
-c <url>NICo API gRPC endpoint.
-f jsonJSON output for scripting.
API_URLEnvironment variable for the API URL.
https_proxy=socks5://...SOCKS5 proxy when reaching the site from off-site.

Common Commands

NeedCommand
API version or reachabilitynico-admin-cli version, nico-admin-cli ping
All managed hostsnico-admin-cli managed-host show --all
One managed hostnico-admin-cli managed-host show <host-machine-id>
Machine event historynico-admin-cli -f json machine show <machine-id>
Debug bundlenico-admin-cli managed-host debug-bundle <machine-id> --start-time <time>
Maintenance modenico-admin-cli managed-host maintenance on --host <host-machine-id> --reference "INC-123"
Health reportsnico-admin-cli machine health-report show <machine-id>
Site Explorer reportsnico-admin-cli site-explorer get-report all
Redfish browsenico-admin-cli redfish browse --address <bmc-ip> <uri>
Network segmentsnico-admin-cli network-segment show
InfiniBand partitionsnico-admin-cli ib-partition show
NVLink partitionsnico-admin-cli nvl-partition show
Compute allocationnico-admin-cli compute-allocation show

Query State History

$nico-admin-cli -c <api-url> -f json machine show <machine-id>

Use this to inspect state transitions, timestamps, and handler outcomes.

Query Health

Aggregate state:

$nico-admin-cli managed-host show <host-machine-id>

Per-source health reports:

$nico-admin-cli machine health-report show <machine-id>

JSON output:

$nico-admin-cli -f json machine health-report show <machine-id>

Add or Remove Health Overrides

Mark a false positive healthy for allocation:

$nico-admin-cli machine health-override add <machine-id> \
> --template mark-healthy \
> --message "false positive INC-123"

Hold a host out of allocation:

$nico-admin-cli machine health-override add <machine-id> \
> --template out-for-repair \
> --message "INC-123"

Remove an override:

$nico-admin-cli machine health-override remove <machine-id> <source-name>

Kubernetes Logs

Namespace names vary by site and deployment generation. Confirm the namespace before copying commands.

$kubectl get ns
$kubectl -n <nico-namespace> get pods
$kubectl -n <nico-namespace> logs deploy/nico-api --tail=500 | grep <machine-id>

Common log sources:

ComponentWhat to look for
nico-apiState transitions, Redfish errors, Vault failures, health reports, gRPC errors.
nico-dhcpDHCP lease and discovery issues.
nico-pxePXE and HTTP boot artifact requests.
Site ExplorerBMC endpoint discovery and scrape failures.
DPF operatorDPU provisioning custom resources and operator status.
nico-dpu-agentDPU heartbeat, BGP, HBN, DHCP relay, and applied network config.

Loki and Grafana

Use a debug bundle when possible:

$GRAFANA_AUTH_TOKEN=<token> \
>nico-admin-cli managed-host debug-bundle <machine-id> \
> --start-time <time> \
> --grafana-url https://<grafana-host>

Use logcli directly when a bundle is not enough:

$logcli --addr=http://localhost:3100 \
> --org-id=<org-id> \
> query \
> --timezone=UTC \
> --from="<YYYY-MM-DDTHH:MM:SSZ>" \
> --to="<YYYY-MM-DDTHH:MM:SSZ>" \
> --limit 0 \
> --forward \
> '{k8s_container_name="<container-name>"}'

On-Metal Host and DPU Logs

LocationUse
/var/log/nico/nico-scout.logHost discovery scout during ingestion.
journalctl -u nico-dpu-agentDPU agent heartbeat, network config, BGP, HBN, and service health.
DPU BMC or rshim consoleUse when SSH to the DPU fails.

Metrics

Metric names may retain the historical carbide_* prefix even when the service name is now NICo.

MetricUse
carbide_machines_per_stateCount hosts by state.
carbide_machines_time_in_state_secondsAverage time in each state.
carbide_machines_per_state_above_slaHosts past state SLA.
carbide_hosts_health_status_countHealthy vs alerting hosts.
carbide_hosts_health_overrides_countActive overrides.
carbide_dpus_up_count / carbide_dpus_healthy_countDPU agent presence and health.
carbide_endpoint_exploration_*BMC discovery health.
carbide_available_ips_countDHCP or IP pool pressure.
carbide_gpus_usable_countGPU capacity for allocation.
carbide_api_vault_requests_failed_totalCredential pipeline failures.