AIStore Observability: Prometheus
AIStore Observability: Prometheus
AIStore Observability: Prometheus
AIStore (AIS) exposes metrics in Prometheus format via HTTP endpoints.
This integration enables comprehensive monitoring of AIS clusters, performance tracking, capacity planning, and long-term trend analysis.
AIS tracks a comprehensive set of performance metrics including:
AIS supports observability through several complementary tools:
Node logs (fine-grained operational events)
CLI for interactive monitoring (e.g., ais show cluster stats)
Monitoring backends:
For load testing and benchmarking metrics, see AIS Load Generator and How To Benchmark AIStore.
A complete catalog of AIS metrics is available at: Monitoring Metrics Reference
Typical Prometheus deployment:
This stack provides:
AIS acts as a first-class Prometheus exporter. Every node automatically:
/metrics for Prometheus to scrapeNo configuration is required to “enable” Prometheus — it is always on.
AIS source metrics (put.size, get.ns, etc.) are exported with AIS naming conventions:
This document primarily uses the exported Prometheus names.
View metrics directly:
Example:
To watch GET rates without Prometheus:
AIS organizes metrics into four major groups, reflected in the codebase and Prometheus exporter:
For GetBatch observability, see Monitoring GetBatch.
AIS exposes labels for filtering and aggregation:
Labels enable PromQL queries such as:
GetBatch is AIStore’s high-performance multi-object retrieval pipeline. Metrics describe throughput, composition (objects vs files), Rx stalls, throttling, and error behavior.
Full details and operational guidance: → Monitoring GetBatch
AIS nodes expose operational alerts and states via ais_target_state_flags.
Flags indicate:
OOS — Out of spaceOOM — Out of memoryDiskFault — Disk failuresNumGoroutines — excessive goroutinesCertificateExpired — TLS expiryKeepAliveErrors — peer connectivity issuesRebalancing, ResilveringRebalanceInterrupted, ResilverInterruptedLowCapacity, LowMemory, LowCPUNodeRestartedMaintenanceModeCertWillSoonExpireClusterStartedNodeStartedVoteInProgressCritical:
Warnings:
Prometheus retention Plan retention around performance analysis needs (14–30 days recommended).
Dashboard segmentation Maintain dashboards for:
Alerts on critical states Monitor:
Scrape frequency 5–15 seconds for critical workloads; 30s+ for low-traffic clusters.
Separately, Prometheus references: