AIStore Observability: Prometheus
AIStore Observability: Prometheus
AIStore (AIS) exposes metrics in Prometheus format via HTTP endpoints.
This integration enables comprehensive monitoring of AIS clusters, performance tracking, capacity planning, and long-term trend analysis.
Table of Contents
- Overview
- Monitoring Stack
- Prometheus Integration
- Node Alerts
- Best Practices
- References
- Related Documentation
Overview
AIS tracks a comprehensive set of performance metrics including:
- Operation counters (GET/PUT/DELETE/etc.)
- Resource utilization (CPU, memory, disk)
- Latencies and throughput
- Network and peer-to-peer streaming statistics
- Extended actions (xactions)
- Error counters and node state
AIS supports observability through several complementary tools:
-
Node logs (fine-grained operational events)
-
CLI for interactive monitoring (e.g.,
ais show cluster stats) -
Monitoring backends:
- Prometheus (recommended)
- Grafana for dashboards & alerting
For load testing and benchmarking metrics, see AIS Load Generator and How To Benchmark AIStore.
A complete catalog of AIS metrics is available at: Monitoring Metrics Reference
Monitoring Stack
Typical Prometheus deployment:
This stack provides:
- Direct metric collection from AIS nodes
- Centralized metric retention
- Grafana visualization & alerting
- Long-term performance & cost analysis
Prometheus Integration
Native Exporter
AIS acts as a first-class Prometheus exporter. Every node automatically:
- Registers all metrics at startup
- Exposes
/metricsfor Prometheus to scrape - Uses Prometheus native formatting and metadata
- Works with both HTTP and HTTPS clusters
No configuration is required to “enable” Prometheus — it is always on.
AIS source metrics (put.size, get.ns, etc.) are exported with AIS naming conventions:
This document primarily uses the exported Prometheus names.
Viewing Raw Metrics
View metrics directly:
Example:
To watch GET rates without Prometheus:
Key Metric Groups
AIS organizes metrics into four major groups, reflected in the codebase and Prometheus exporter:
For GetBatch observability, see Monitoring GetBatch.
Metric Labels
AIS exposes labels for filtering and aggregation:
Labels enable PromQL queries such as:
Essential PromQL Queries
GET operations per second
Average GET latency (ms)
Disk utilization
GET error percentage
Total cluster capacity usage
GetBatch (x-moss) Queries
GetBatch is AIStore’s high-performance multi-object retrieval pipeline. Metrics describe throughput, composition (objects vs files), Rx stalls, throttling, and error behavior.
Work items per second
Logical payload throughput
Stall breakdown (RxWait vs Throttle)
Soft vs Hard Error Rates
Full details and operational guidance: → Monitoring GetBatch
Node Alerts
AIS nodes expose operational alerts and states via ais_target_state_flags.
Flags indicate:
Red (critical)
OOS— Out of spaceOOM— Out of memoryDiskFault— Disk failuresNumGoroutines— excessive goroutinesCertificateExpired— TLS expiryKeepAliveErrors— peer connectivity issues
Warning
Rebalancing,ResilveringRebalanceInterrupted,ResilverInterruptedLowCapacity,LowMemory,LowCPUNodeRestartedMaintenanceModeCertWillSoonExpire
Informational
ClusterStartedNodeStartedVoteInProgress
CLI Monitoring
Prometheus Queries
Critical:
Warnings:
Grafana Alert Example
Best Practices
-
Prometheus retention Plan retention around performance analysis needs (14–30 days recommended).
-
Dashboard segmentation Maintain dashboards for:
- Cluster overview
- Node-level performance
- Resource utilization
- Error monitoring
- Extended actions (GetBatch, rebalance, ETL)
-
Alerts on critical states Monitor:
- Node state flags
- Error spikes
- Disk utilization
- High throttle or rxwait stalls (GetBatch)
-
Scrape frequency 5–15 seconds for critical workloads; 30s+ for low-traffic clusters.
Related Documentation
Separately, Prometheus references: