AIStore Observability: Metrics Reference
AIStore Observability: Metrics Reference
AIStore (AIS) exposes a comprehensive set of metrics that provide insights into system performance, resource utilization, and operational status. This reference catalogs available metrics with descriptions and usage guidance.
Table of Contents
- Prometheus: major changes in v3.26
- Variable labels
- Common metrics: AIS targets and gateways
- Target metrics
- Backend metrics
- Related Documentation
Prometheus: major changes in v3.26
- So-called default
go_*counters and gauges (go_gc.go_metstats. etc.) are completely gone - Metrics are now updated directly in real time
- Previously: periodically via
prometheus.Collectinterface - See related note in stats/prom.go
- Previously: periodically via
- AIS is no longer publishing internally computed latencies and throughputs
- Use
*.ns.total(nanoseconds) and*.size(bytes) metrics to compute latency and throughput, respectively- Based on user-controlled time intervals - for reference, see CLI
performance throughputandperformance latency - Note: for Prometheus client, internal
.ns.totalsuffix becomes_ns_total, and.size, respectively,_bytes
- Based on user-controlled time intervals - for reference, see CLI
- In addition to total aggregated numbers there are now separately computed per-backend latency and throughput numbers
- Those with
aws.prefix, for instance.
- Those with
Variable labels
Each AIS metric carries node_id - a static label in Prometheus terminology.
Starting v3.26, majority of the metrics will also contain variable labels:
- Variable Labels:
bucket: Name of the associated bucket.xkind: Job kind.mountpath: Mountpath.
- All I/O metrics now carry the bucket name (or
Cname, to be precise) as a Prometheus variable label - All in-cluster writing generated by xactions (jobs) now also have this xaction label as well: the respective kind
- One major side-effect of the above is that we will now see more PUT metrics, and not only those that result from user PUT requests
- All GET, PUT, and DELETE errors also have the bucket label
- All FSHC related errors (the so called IO errors) carry mountpath (ie., faulty disk) label.
Common metrics: AIS targets and gateways
- Request Metrics:
GetCount: Total number of executed GET(object) requests.- Variable Labels:
bucket
- Variable Labels:
PutCount: Total number of executed PUT(object) requests.- Variable Labels:
bucket,xkind
- Variable Labels:
HeadCount: Total number of executed HEAD(object) requests (currently only remote HEAD).- Variable Labels:
bucket
- Variable Labels:
AppendCount: Total number of executed APPEND(object) requests.- Variable Labels:
bucket
- Variable Labels:
DeleteCount: Total number of executed DELETE(object) requests.- Variable Labels:
bucket
- Variable Labels:
RenameCount: Total number of executed rename(object) requests.- Variable Labels:
bucket
- Variable Labels:
ListCount: Total number of executed list-objects requests.- Variable Labels:
bucket
- Variable Labels:
Common Error Counters
- Error Metrics:
ErrGetCount: Total number of GET(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrPutCount: Total number of PUT(object) errors.- Variable Labels:
bucket,xkind
- Variable Labels:
ErrHeadCount: Total number of HEAD(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrAppendCount: Total number of APPEND(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrDeleteCount: Total number of DELETE(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrRenameCount: Total number of rename(object) errors.- Variable Labels:
bucket
- Variable Labels:
ErrListCount: Total number of list-objects errors.- Variable Labels:
bucket
- Variable Labels:
Common Latencies
- Latency Metrics:
GetLatency: GET average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
GetLatencyTotal: GET total cumulative time (nanoseconds).- Variable Labels:
bucket
- Variable Labels:
ListLatency: List-objects average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
For convenience, we also include here a (somewhat redundant) table that summarizes common metrics.
Target metrics
-
Out-of-Band Metrics:
VerChangeCount: Number of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster).- Variable Labels:
bucket
- Variable Labels:
VerChangeSize: Total cumulative size (bytes) of objects updated out-of-band across all backends combined.- Variable Labels:
bucket
- Variable Labels:
RemoteDeletedDelCount: Number of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster).- Variable Labels:
bucket
- Variable Labels:
-
PUT Latency Metrics:
PutLatency: PUT average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket,xkind
- Variable Labels:
PutLatencyTotal: PUT total cumulative time (nanoseconds).- Variable Labels:
bucket,xkind
- Variable Labels:
-
HEAD Latency Metrics:
HeadLatencyTotal: HEAD total cumulative time (nanoseconds).- Variable Labels:
bucket
- Variable Labels:
-
APPEND Latency Metrics:
AppendLatency: APPEND average time (milliseconds) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
-
Throughput Metrics:
GetThroughput: GET average throughput (MB/s) over the last periodic.stats_time interval.- Variable Labels:
bucket
- Variable Labels:
PutThroughput: PUT average throughput (MB/s) over the last periodic.stats_time interval.- Variable Labels:
bucket,xkind
- Variable Labels:
-
Size Metrics:
GetSize: GET total cumulative size (bytes).- Variable Labels:
bucket
- Variable Labels:
PutSize: PUT total cumulative size (bytes).- Variable Labels:
bucket,xkind
- Variable Labels:
-
Error Metrics:
ErrPutCksumCount: PUT number of checksum errors.- Variable Labels:
bucket,xkind
- Variable Labels:
ErrFSHCCount: Number of times filesystem health checker (FSHC) was triggered by an I/O error or errors.- Variable Labels:
mountpath
- Variable Labels:
IOErrGetCount: GET number of I/O errors (excluding remote backend and network errors).- Variable Labels:
bucket
- Variable Labels:
IOErrDeleteCount: DELETE(object) number of I/O errors (excluding remote backend and network errors).- Variable Labels:
bucket
- Variable Labels:
For convenience, a table that summarizes target metrics follows below.
Backend metrics
-
GET Metrics:
remote_get_count: Total number of executed remote GET requests.- Variable Labels:
bucket
- Variable Labels:
remote_get_ns_total: Total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects.- Variable Labels:
bucket
- Variable Labels:
remote_get_bytes_total: Total cumulative size (bytes) of all remote GET transactions.- Variable Labels:
bucket
- Variable Labels:
-
PUT Metrics:
remote_put_count: Total number of executed remote PUT requests to a given backend.- Variable Labels:
bucket,xkind
- Variable Labels:
remote_put_ns_total: Total cumulative time (nanoseconds) to execute remote PUT requests and store new object versions in-cluster.- Variable Labels:
bucket,xkind
- Variable Labels:
remote_e2e_put_ns_total: Total end-to-end time (nanoseconds) servicing remote PUT requests (includes receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object).- Variable Labels:
bucket,xkind
- Variable Labels:
remote_e2e_put_bytes_total: Total cumulative size (bytes) of all PUTs to a given remote backend.- Variable Labels:
bucket,xkind
- Variable Labels:
-
HEAD Metrics:
remote_head_count: Total number of executed remote HEAD requests to a given backend.- Variable Labels:
bucket
- Variable Labels:
remote_head_ns_total: Total cumulative time (nanoseconds) to execute remote HEAD requests.- Variable Labels:
bucket
- Variable Labels:
-
Out-of-Band Updates:
remote_ver_change_count: Number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster).- Variable Labels:
bucket
- Variable Labels:
remote_ver_change_bytes_total: Total cumulative size (bytes) of objects that were updated out-of-band.- Variable Labels:
bucket
- Variable Labels: