AIStore Observability: Metrics Reference

AIStore (AIS) exposes a comprehensive set of metrics that provide insights into system performance, resource utilization, and operational status. This reference catalogs available metrics with descriptions and usage guidance.

Prometheus: major changes in v3.26

So-called default go_* counters and gauges (go_gc. go_metstats. etc.) are completely gone
Metrics are now updated directly in real time
- Previously: periodically via prometheus.Collect interface
- See related note in stats/prom.go
AIS is no longer publishing internally computed latencies and throughputs
Use *.ns.total (nanoseconds) and *.size (bytes) metrics to compute latency and throughput, respectively
- Based on user-controlled time intervals - for reference, see CLI performance throughput and performance latency
- Note: for Prometheus client, internal .ns.total suffix becomes _ns_total, and .size, respectively, _bytes
In addition to total aggregated numbers there are now separately computed per-backend latency and throughput numbers
- Those with aws. prefix, for instance.

Variable labels

Each AIS metric carries node_id - a static label in Prometheus terminology.

Starting v3.26, majority of the metrics will also contain variable labels:

Variable Labels:
- bucket: Name of the associated bucket.
- xkind: Job kind.
- mountpath: Mountpath.

All I/O metrics now carry the bucket name (or Cname, to be precise) as a Prometheus variable label
All in-cluster writing generated by xactions (jobs) now also have this xaction label as well: the respective kind
- One major side-effect of the above is that we will now see more PUT metrics, and not only those that result from user PUT requests
All GET, PUT, and DELETE errors also have the bucket label
All FSHC related errors (the so called IO errors) carry mountpath (ie., faulty disk) label.

Common metrics: AIS targets and gateways

Request Metrics:
- GetCount: Total number of executed GET(object) requests.
  - Variable Labels: bucket
- PutCount: Total number of executed PUT(object) requests.
  - Variable Labels: bucket, xkind
- HeadCount: Total number of executed HEAD(object) requests (currently only remote HEAD).
  - Variable Labels: bucket
- AppendCount: Total number of executed APPEND(object) requests.
  - Variable Labels: bucket
- DeleteCount: Total number of executed DELETE(object) requests.
  - Variable Labels: bucket
- RenameCount: Total number of executed rename(object) requests.
  - Variable Labels: bucket
- ListCount: Total number of executed list-objects requests.
  - Variable Labels: bucket

Common Error Counters

Error Metrics:
- ErrGetCount: Total number of GET(object) errors.
  - Variable Labels: bucket
- ErrPutCount: Total number of PUT(object) errors.
  - Variable Labels: bucket, xkind
- ErrHeadCount: Total number of HEAD(object) errors.
  - Variable Labels: bucket
- ErrAppendCount: Total number of APPEND(object) errors.
  - Variable Labels: bucket
- ErrDeleteCount: Total number of DELETE(object) errors.
  - Variable Labels: bucket
- ErrRenameCount: Total number of rename(object) errors.
  - Variable Labels: bucket
- ErrListCount: Total number of list-objects errors.
  - Variable Labels: bucket

Common Latencies

Latency Metrics:
- GetLatency: GET average time (milliseconds) over the last periodic.stats_time interval.
  - Variable Labels: bucket
- GetLatencyTotal: GET total cumulative time (nanoseconds).
  - Variable Labels: bucket
- ListLatency: List-objects average time (milliseconds) over the last periodic.stats_time interval.
  - Variable Labels: bucket

For convenience, we also include here a (somewhat redundant) table that summarizes common metrics.

Internal name	Public name	Internal Type	Description (Prometheus help)	Prometheus labels
`get.n`	`get_count`	counter	total number of executed GET(object) requests	default
`put.n`	`put_count`	counter	total number of executed PUT(object) requests	default
`head.n`	`head_count`	counter	total number of executed HEAD(object) requests	default
`append.n`	`append_count`	counter	total number of executed APPEND(object) requests	default
`del.n`	`del_count`	counter	total number of executed DELETE(object) requests	default
`ren.n`	`ren_count`	counter	total number of executed rename(object) requests	default
`lst.n`	`lst_count`	counter	total number of executed list-objects requests	default
`err.get.n`	`err_get_count`	counter	total number of GET(object) errors	default
`err.put.n`	`err_put_count`	counter	total number of PUT(object) errors	default
`err.head.n`	`err_head_count`	counter	total number of HEAD(object) errors	default
`err.append.n`	`err_append_count`	counter	total number of APPEND(object) errors	default
`err.del.n`	`err_del_count`	counter	total number of DELETE(object) errors	default
`err.ren.n`	`err_ren_count`	counter	total number of rename(object) errors	default
`err.lst.n`	`err_lst_count`	counter	total number of list-objects errors	default
`err.http.write.n`	`err_http_write_count`	counter	total number of HTTP write-response errors	default
`err.dl.n`	`err_dl_count`	counter	downloader: number of download errors	default
`err.put.mirror.n`	`err_put_mirror_count`	counter	number of n-way mirroring errors	default
`get.ns`	`get_ms`	latency	GET: average time (milliseconds) over the last periodic.stats_time interval	default
`get.ns.total`	`get_ns_total`	total	GET: total cumulative time (nanoseconds)	default
`lst.ns`	`lst_ms`	latency	list-objects: average time (milliseconds) over the last periodic.stats_time interval	default
`kalive.ns`	`kalive_ms`	latency	in-cluster keep-alive (heartbeat): average time (milliseconds) over the last periodic.stats_time interval	default
`up.ns.time`	`uptime`	special	this node’s uptime since its startup (seconds)	default
`state.flags`	`state_flags`	gauge	bitwise 64-bit value that carries enumerated node-state flags, including warnings and alerts; see https://github.com/NVIDIA/aistore/blob/main/cmn/cos/node_state.go

Target metrics

Out-of-Band Metrics:
- VerChangeCount: Number of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster).
  - Variable Labels: bucket
- VerChangeSize: Total cumulative size (bytes) of objects updated out-of-band across all backends combined.
  - Variable Labels: bucket
- RemoteDeletedDelCount: Number of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster).
  - Variable Labels: bucket
PUT Latency Metrics:
- PutLatency: PUT average time (milliseconds) over the last periodic.stats_time interval.
  - Variable Labels: bucket, xkind
- PutLatencyTotal: PUT total cumulative time (nanoseconds).
  - Variable Labels: bucket, xkind
HEAD Latency Metrics:
- HeadLatencyTotal: HEAD total cumulative time (nanoseconds).
  - Variable Labels: bucket
APPEND Latency Metrics:
- AppendLatency: APPEND average time (milliseconds) over the last periodic.stats_time interval.
  - Variable Labels: bucket
Throughput Metrics:
- GetThroughput: GET average throughput (MB/s) over the last periodic.stats_time interval.
  - Variable Labels: bucket
- PutThroughput: PUT average throughput (MB/s) over the last periodic.stats_time interval.
  - Variable Labels: bucket, xkind
Size Metrics:
- GetSize: GET total cumulative size (bytes).
  - Variable Labels: bucket
- PutSize: PUT total cumulative size (bytes).
  - Variable Labels: bucket, xkind
Error Metrics:
- ErrPutCksumCount: PUT number of checksum errors.
  - Variable Labels: bucket, xkind
- ErrFSHCCount: Number of times filesystem health checker (FSHC) was triggered by an I/O error or errors.
  - Variable Labels: mountpath
- IOErrGetCount: GET number of I/O errors (excluding remote backend and network errors).
  - Variable Labels: bucket
- IOErrDeleteCount: DELETE(object) number of I/O errors (excluding remote backend and network errors).
  - Variable Labels: bucket

For convenience, a table that summarizes target metrics follows below.

Internal name	Public name	Internal Type	Description (Prometheus help)	Prometheus labels
`disk.<DISK-NAME>.read.bps`	`disk_read_mbps`	computed-bandwidth	read bandwidth (MB/s)	map[disk:`<DISK-NAME>` node_id:`<AIS-NODE-ID>`]
`disk.<DISK-NAME>.avg.rsize`	`disk_avg_rsize`	gauge	average read size (bytes)	map[disk:`<DISK-NAME>` node_id:`<AIS-NODE-ID>`]
`disk.<DISK-NAME>.write.bps`	`disk_write_mbps`	computed-bandwidth	write bandwidth (MB/s)	map[disk:`<DISK-NAME>` node_id:`<AIS-NODE-ID>`]
`disk.<DISK-NAME>.avg.wsize`	`disk_avg_wsize`	gauge	average write size (bytes)	map[disk:`<DISK-NAME>` node_id:`<AIS-NODE-ID>`]
`disk.<DISK-NAME>.util`	`disk_util`	gauge	disk utilization (%%)	map[disk:`<DISK-NAME>` node_id:`<AIS-NODE-ID>`]
`lru.evict.n`	`lru_evict_count`	counter	number of LRU evictions	default
`lru.evict.size`	`lru_evict_bytes`	size	total cumulative size (bytes) of LRU evictions	default
`cleanup.store.n`	`cleanup_store_count`	counter	space cleanup: number of removed misplaced objects and old work files	default
`cleanup.store.size`	`cleanup_store_bytes`	size	space cleanup: total size (bytes) of all removed misplaced objects and old work files (not including removed deleted objects)	default
`ver.change.n`	`ver_change_count`	counter	number of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster)	default
`ver.change.size`	`ver_change_bytes`	size	total cumulative size (bytes) of objects that were updated out-of-band across all backends combined	default
`remote.deleted.del.n`	`remote_deleted_del_count`	counter	number of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster)	default
`put.ns`	`put_ms`	latency	PUT: average time (milliseconds) over the last periodic.stats_time interval	default
`put.ns.total`	`put_ns_total`	total	PUT: total cumulative time (nanoseconds)	default
`append.ns`	`append_ms`	latency	APPEND(object): average time (milliseconds) over the last periodic.stats_time interval	default
`get.redir.ns`	`get_redir_ms`	latency	GET: average gateway-to-target HTTP redirect latency (milliseconds) over the last periodic.stats_time interval	default
`put.redir.ns`	`put_redir_ms`	latency	PUT: average gateway-to-target HTTP redirect latency (milliseconds) over the last periodic.stats_time interval	default
`ratelim.retry.get.n`	`ratelim_retry_get_n`	counter	GET: number of rate-limited retries triggered by remote backends returning 409 and 503 status codes	default
`ratelim.retry.get.ns.total`	`ratelim_retry_get_ns_total`	total	GET: total retrying time (nanoseconds) caused by remote backends returning 409 and 503 status codes	default
`ratelim.retry.put.n`	`ratelim_retry_put_n`	counter	PUT: number of rate-limited retries triggered by remote backends returning 409 and 503 status codes	default
`ratelim.retry.put.ns.total`	`ratelim_retry_put_ns_total`	total	PUT: total retrying time (nanoseconds) caused by remote backends returning 409 and 503 status codes	default
`get.bps`	`get_mbps`	bandwidth	GET: average throughput (MB/s) over the last periodic.stats_time interval	default
`put.bps`	`put_mbps`	bandwidth	PUT: average throughput (MB/s) over the last periodic.stats_time interval	default
`get.size`	`get_bytes`	size	GET: total cumulative size (bytes)	default
`put.size`	`put_bytes`	size	PUT: total cumulative size (bytes)	default
`err.cksum.n`	`err_cksum_count`	counter	PUT: number of checksum errors	default
`err.fshc.n`	`err_fshc_count`	counter	number of times filesystem health checker (FSHC) was triggered by an I/O error or errors	default
`err.io.get.n`	`err_io_get_count`	counter	GET: number of I/O errors not including remote backend and network errors	default
`err.io.put.n`	`err_io_put_count`	counter	PUT: number of I/O errors not including remote backend and network errors	default
`err.io.del.n`	`err_io_del_count`	counter	DELETE(object): number of I/O errors not including remote backend and network errors	default
`stream.out.n`	`stream_out_count`	counter	intra-cluster streaming communications: number of sent objects	default
`stream.out.size`	`stream_out_bytes`	size	intra-cluster streaming communications: total cumulative size (bytes) of all transmitted objects	default
`stream.in.n`	`stream_in_count`	counter	intra-cluster streaming communications: number of received objects	default
`stream.in.size`	`stream_in_bytes`	size	intra-cluster streaming communications: total cumulative size (bytes) of all received objects	default
`dl.size`	`dl_bytes`	size	total downloaded size (bytes)	default
`dl.ns.total`	`dl_ns_total`	total	total downloading time (nanoseconds)	default
`dsort.creation.req.n`	`dsort_creation_req_count`	counter	dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics	default
`dsort.creation.resp.n`	`dsort_creation_resp_count`	counter	dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics	default
`dsort.creation.resp.ns`	`dsort_creation_resp_ms`	latency	dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics	default
`dsort.extract.shard.dsk.n`	`dsort_extract_shard_dsk_count`	counter	dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics	default
`dsort.extract.shard.mem.n`	`dsort_extract_shard_mem_count`	counter	dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics	default
`dsort.extract.shard.size`	`dsort_extract_shard_bytes`	size	dsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metrics	default
`lcache.collision.n`	`lcache_collision_count`	counter	number of LOM cache collisions (core, internal)	default
`lcache.evicted.n`	`lcache_evicted_count`	counter	number of LOM cache evictions (core, internal)	default
`lcache.flush.cold.n`	`lcache_flush_cold_count`	counter	number of times a LOM from cache was written to stable storage (core, internal)	default
`remais.get.n`	`remote_get_count`	counter	GET: total number of executed remote requests	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.get.ns.total`	`remote_get_ns_total`	total	GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.get.size`	`remote_get_bytes_total`	size	GET: total cumulative size (bytes) of all remote GET transactions	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.head.n`	`remote_head_count`	counter	HEAD: total number of executed remote requests to a given backend	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.put.n`	`remote_put_count`	counter	PUT: total number of executed remote requests to a given backend	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.put.ns.total`	`remote_put_ns_total`	total	PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.e2e.put.ns.total`	`remote_e2e_put_ns_total`	total	PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.put.size`	`remote_e2e_put_bytes_total`	size	PUT: total cumulative size (bytes) of all PUTs to a given remote backend	map[backend:remais node_id:ClCt8081]
`remais.ver.change.n`	`remote_ver_change_count`	counter	number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)	map[backend:remais node_id:`<AIS-NODE-ID>`]
`remais.ver.change.size`	`remote_ver_change_bytes_total`	size	total cumulative size of objects that were updated out-of-band	map[backend:remais node_id:`<AIS-NODE-ID>`]
`gcp.get.n`	`remote_get_count`	counter	GET: total number of executed remote requests	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.get.ns.total`	`remote_get_ns_total`	total	GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.get.size`	`remote_get_bytes_total`	size	GET: total cumulative size (bytes) of all remote transactions	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.head.n`	`remote_head_count`	counter	HEAD: total number of executed remote requests to a given backend	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.put.n`	`remote_put_count`	counter	PUT: total number of executed remote requests to a given backend	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.put.ns.total`	`remote_put_ns_total`	total	PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.e2e.put.ns.total`	`remote_e2e_put_ns_total`	total	PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.put.size`	`remote_e2e_put_bytes_total`	size	PUT: total cumulative size (bytes) of all PUTs to a given remote backend	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.ver.change.n`	`remote_ver_change_count`	counter	number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`gcp.ver.change.size`	`remote_ver_change_bytes_total`	size	total cumulative size of objects that were updated out-of-band	map[backend:gcp node_id:`<AIS-NODE-ID>`]
`aws.get.n`	`remote_get_count`	counter	GET: total number of executed remote requests	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.get.ns.total`	`remote_get_ns_total`	total	GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.get.size`	`remote_get_bytes_total`	size	GET: total cumulative size (bytes) of all remote transactions	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.head.n`	`remote_head_count`	counter	HEAD: total number of executed remote requests to a given backend	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.put.n`	`remote_put_count`	counter	PUT: total number of executed remote requests to a given backend	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.put.ns.total`	`remote_put_ns_total`	total	PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.e2e.put.ns.total`	`remote_e2e_put_ns_total`	total	PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.put.size`	`remote_e2e_put_bytes_total`	size	PUT: total cumulative size (bytes) of all PUTs to a given remote backend	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.ver.change.n`	`remote_ver_change_count`	counter	number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)	map[backend:aws node_id:`<AIS-NODE-ID>`]
`aws.ver.change.size`	`remote_ver_change_bytes_total`	size	total cumulative size of objects that were updated out-of-band	map[backend:aws node_id:`<AIS-NODE-ID>`]
`azure.get.n`	`remote_get_count`	counter	GET: total number of executed remote requests	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.get.ns.total`	`remote_get_ns_total`	total	GET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.get.size`	`remote_get_bytes_total`	size	GET: total cumulative size (bytes) of all remote transactions	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.head.n`	`remote_head_count`	counter	HEAD: total number of executed remote requests to a given backend	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.put.n`	`remote_put_count`	counter	PUT: total number of executed remote requests to a given backend	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.put.ns.total`	`remote_put_ns_total`	total	PUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-cluster	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.e2e.put.ns.total`	`remote_e2e_put_ns_total`	total	PUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.put.size`	`remote_e2e_put_bytes_total`	size	PUT: total cumulative size (bytes) of all PUTs to a given remote backend	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.ver.change.n`	`remote_ver_change_count`	counter	number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)	map[backend:azure node_id:`<AIS-NODE-ID>`]
`azure.ver.change.size`	`remote_ver_change_bytes_total`	size	total cumulative size of objects that were updated out-of-band	map[backend:azure node_id:`<AIS-NODE-ID>`]

Backend metrics

GET Metrics:
- remote_get_count: Total number of executed remote GET requests.
  - Variable Labels: bucket
- remote_get_ns_total: Total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects.
  - Variable Labels: bucket
- remote_get_bytes_total: Total cumulative size (bytes) of all remote GET transactions.
  - Variable Labels: bucket
PUT Metrics:
- remote_put_count: Total number of executed remote PUT requests to a given backend.
  - Variable Labels: bucket, xkind
- remote_put_ns_total: Total cumulative time (nanoseconds) to execute remote PUT requests and store new object versions in-cluster.
  - Variable Labels: bucket, xkind
- remote_e2e_put_ns_total: Total end-to-end time (nanoseconds) servicing remote PUT requests (includes receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object).
  - Variable Labels: bucket, xkind
- remote_e2e_put_bytes_total: Total cumulative size (bytes) of all PUTs to a given remote backend.
  - Variable Labels: bucket, xkind
HEAD Metrics:
- remote_head_count: Total number of executed remote HEAD requests to a given backend.
  - Variable Labels: bucket
- remote_head_ns_total: Total cumulative time (nanoseconds) to execute remote HEAD requests.
  - Variable Labels: bucket
Out-of-Band Updates:
- remote_ver_change_count: Number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster).
  - Variable Labels: bucket
- remote_ver_change_bytes_total: Total cumulative size (bytes) of objects that were updated out-of-band.
  - Variable Labels: bucket

Document	Description
Overview	Introduction to AIS observability
CLI	Command-line monitoring tools
Logs	Log-based observability
Prometheus	Configuring Prometheus with AIS
Grafana	Visualizing AIS metrics with Grafana
Kubernetes	Working with Kubernetes monitoring stacks

AIStore Observability: Metrics Reference

AIStore Observability: Metrics Reference

Table of Contents

Prometheus: major changes in v3.26

Variable labels

Common metrics: AIS targets and gateways

Common Error Counters

Common Latencies

Target metrics

Backend metrics

Table of Contents

Prometheus: major changes in v3.26

Variable labels

Common metrics: AIS targets and gateways

Common Error Counters

Common Latencies

Target metrics

Backend metrics

Table of Contents

Prometheus: major changes in v3.26

Variable labels

Common metrics: AIS targets and gateways

Common Error Counters

Common Latencies

Target metrics

Backend metrics

Related Documentation

Table of Contents

Prometheus: major changes in v3.26

Variable labels

Common metrics: AIS targets and gateways

Common Error Counters

Common Latencies

Target metrics

Backend metrics

Related Documentation