AIStore Observability: Prometheus

View as Markdown

AIStore (AIS) exposes metrics in Prometheus format via HTTP endpoints.

This integration enables comprehensive monitoring of AIS clusters, performance tracking, capacity planning, and long-term trend analysis.

Table of Contents


Overview

AIS tracks a comprehensive set of performance metrics including:

  • Operation counters (GET/PUT/DELETE/etc.)
  • Resource utilization (CPU, memory, disk)
  • Latencies and throughput
  • Network and peer-to-peer streaming statistics
  • Extended actions (xactions)
  • Error counters and node state

AIS supports observability through several complementary tools:

  • Node logs (fine-grained operational events)

  • CLI for interactive monitoring (e.g., ais show cluster stats)

  • Monitoring backends:

    • Prometheus (recommended)
    • Grafana for dashboards & alerting

For load testing and benchmarking metrics, see AIS Load Generator and How To Benchmark AIStore.

A complete catalog of AIS metrics is available at: Monitoring Metrics Reference


Monitoring Stack

Typical Prometheus deployment:

┌────────────────┐ ┌────────────────┐
│ │ scrape│ │
│ Prometheus │◄──────┤ AIStore Node │
│ │ │ /metrics │
└────────────────┘ └────────────────┘
││
││ query
┌────────────────┐
│ │
│ Grafana │
│ │
└────────────────┘

This stack provides:

  • Direct metric collection from AIS nodes
  • Centralized metric retention
  • Grafana visualization & alerting
  • Long-term performance & cost analysis

Prometheus Integration

Native Exporter

AIS acts as a first-class Prometheus exporter. Every node automatically:

  1. Registers all metrics at startup
  2. Exposes /metrics for Prometheus to scrape
  3. Uses Prometheus native formatting and metadata
  4. Works with both HTTP and HTTPS clusters

No configuration is required to “enable” Prometheus — it is always on.

AIS source metrics (put.size, get.ns, etc.) are exported with AIS naming conventions:

ais_target_<metric_name>{node_id="T1"} <value>

This document primarily uses the exported Prometheus names.


Viewing Raw Metrics

View metrics directly:

$$ curl http://<node>:<port>/metrics
$# or
$$ curl https://<node>:<port>/metrics

Example:

# HELP ais_target_put_bytes total bytes served via PUT
# TYPE ais_target_put_bytes counter
ais_target_put_bytes{node_id="ClCt8081"} 1.721761792e+10
# HELP ais_target_put_ns_total total PUT latency (nanoseconds)
# TYPE ais_target_put_ns_total counter
ais_target_put_ns_total{node_id="ClCt8081"} 9.44367232e+09
# HELP ais_target_state_flags node state and alert flags
# TYPE ais_target_state_flags gauge
ais_target_state_flags{node_id="ClCt8081"} 6

To watch GET rates without Prometheus:

$for i in {1..99999}; do
$ curl -s http://hostname:8081/metrics | grep "ais_target_get_count"
$ sleep 1
$done

Key Metric Groups

AIS organizes metrics into four major groups, reflected in the codebase and Prometheus exporter:

GroupDescriptionExamples
1. DatapathGET/PUT counters, sizes, latencies, rate-limiting, I/O errorsais_target_get_count, ais_target_put_bps, ais_target_ratelim_retry_get_n
2. Metadata (in-memory)Lcache activity (evictions, collisions)ais_target_lcache_evicted_count
3. Extended Actions (xactions)Background & multi-object jobs: LRU, EC, rebalance, ETL, Download, DSort, GetBatchais_target_lru_evict_n, ais_target_getbatch_n
4. StreamsLong-lived peer-to-peer (SharedDM) streaming channelsais_target_streams_out_obj_n

For GetBatch observability, see Monitoring GetBatch.


Metric Labels

AIS exposes labels for filtering and aggregation:

LabelUsage
node_idNode identity (target or gateway)
diskDisk name for per-disk metrics
bucketSource/destination bucket
xactionXaction UUID for multi-object jobs
sliceFor erasure coding slice metrics
archpathFor per-file shard extraction (GetBatch)

Labels enable PromQL queries such as:

sum by (node_id)(rate(ais_target_put_bytes[5m]))
sum by (disk)(ais_target_disk_util)

Essential PromQL Queries

GET operations per second

1sum(rate(ais_target_get_count[5m]))

Average GET latency (ms)

1sum(rate(ais_target_get_ns_total[5m]))
2/ sum(rate(ais_target_get_count[5m]))
3/ 1e6 # convert ns → ms

Disk utilization

1ais_target_disk_util{disk="nvme0n1"}

GET error percentage

1sum(rate(ais_target_err_get_count[5m]))
2/ sum(rate(ais_target_get_count[5m])) * 100

Total cluster capacity usage

1sum(ais_target_capacity_used)
2/
3sum(ais_target_capacity_total)
4* 100

GetBatch (x-moss) Queries

GetBatch is AIStore’s high-performance multi-object retrieval pipeline. Metrics describe throughput, composition (objects vs files), Rx stalls, throttling, and error behavior.

Work items per second

1sum(rate(ais_target_getbatch_n[5m]))

Logical payload throughput

1sum(rate(ais_target_getbatch_obj_size[5m]))
2+
3sum(rate(ais_target_getbatch_file_size[5m]))

Stall breakdown (RxWait vs Throttle)

1sum(rate(ais_target_getbatch_rxwait_ns[5m]))
2/
3(
4 sum(rate(ais_target_getbatch_rxwait_ns[5m])) +
5 sum(rate(ais_target_getbatch_throttle_ns[5m]))
6)

Soft vs Hard Error Rates

1rate(ais_target_err_soft_getbatch_n[5m])
2rate(ais_target_err_getbatch_n[5m])

Full details and operational guidance: Monitoring GetBatch


Node Alerts

AIS nodes expose operational alerts and states via ais_target_state_flags. Flags indicate:

Red (critical)

  • OOS — Out of space
  • OOM — Out of memory
  • DiskFault — Disk failures
  • NumGoroutines — excessive goroutines
  • CertificateExpired — TLS expiry
  • KeepAliveErrors — peer connectivity issues

Warning

  • Rebalancing, Resilvering
  • RebalanceInterrupted, ResilverInterrupted
  • LowCapacity, LowMemory, LowCPU
  • NodeRestarted
  • MaintenanceMode
  • CertWillSoonExpire

Informational

  • ClusterStarted
  • NodeStarted
  • VoteInProgress

CLI Monitoring

$$ ais show cluster

Prometheus Queries

Critical:

1ais_target_state_flags & 8192 > 0 # OOS
2or ais_target_state_flags & 16384 > 0 # OOM
3or ais_target_state_flags & 65536 > 0 # DiskFault

Warnings:

1ais_target_state_flags & 4096 > 0 # LowCapacity
2or ais_target_state_flags & 8192 > 0 # LowMemory

Grafana Alert Example

ais_target_state_flags{node_id=~"$node"} & 8192 > 0

Best Practices

  1. Prometheus retention Plan retention around performance analysis needs (14–30 days recommended).

  2. Dashboard segmentation Maintain dashboards for:

    • Cluster overview
    • Node-level performance
    • Resource utilization
    • Error monitoring
    • Extended actions (GetBatch, rebalance, ETL)
  3. Alerts on critical states Monitor:

    • Node state flags
    • Error spikes
    • Disk utilization
    • High throttle or rxwait stalls (GetBatch)
  4. Scrape frequency 5–15 seconds for critical workloads; 30s+ for low-traffic clusters.


DocumentDescription
OverviewAIS observability introduction
CLICLI monitoring and commands
LogsLog-based observability
Metrics ReferenceFull AIS metric catalog
GrafanaGrafana dashboards
KubernetesK8s deployment monitoring
GetBatch MonitoringMulti-object retrieval metrics and analysis

Separately, Prometheus references: