AIStore Observability: Metrics Reference

View as Markdown

AIStore (AIS) exposes a comprehensive set of metrics that provide insights into system performance, resource utilization, and operational status. This reference catalogs available metrics with descriptions and usage guidance.

Table of Contents

Prometheus: major changes in v3.26

  • So-called default go_* counters and gauges (go_gc. go_metstats. etc.) are completely gone
  • Metrics are now updated directly in real time
    • Previously: periodically via prometheus.Collect interface
    • See related note in stats/prom.go
  • AIS is no longer publishing internally computed latencies and throughputs
  • Use *.ns.total (nanoseconds) and *.size (bytes) metrics to compute latency and throughput, respectively
  • In addition to total aggregated numbers there are now separately computed per-backend latency and throughput numbers
    • Those with aws. prefix, for instance.

Variable labels

Each AIS metric carries node_id - a static label in Prometheus terminology.

Starting v3.26, majority of the metrics will also contain variable labels:

  • Variable Labels:
    • bucket: Name of the associated bucket.
    • xkind: Job kind.
    • mountpath: Mountpath.
  • All I/O metrics now carry the bucket name (or Cname, to be precise) as a Prometheus variable label
  • All in-cluster writing generated by xactions (jobs) now also have this xaction label as well: the respective kind
    • One major side-effect of the above is that we will now see more PUT metrics, and not only those that result from user PUT requests
  • All GET, PUT, and DELETE errors also have the bucket label
  • All FSHC related errors (the so called IO errors) carry mountpath (ie., faulty disk) label.

Common metrics: AIS targets and gateways

  • Request Metrics:
    • GetCount: Total number of executed GET(object) requests.
      • Variable Labels: bucket
    • PutCount: Total number of executed PUT(object) requests.
      • Variable Labels: bucket, xkind
    • HeadCount: Total number of executed HEAD(object) requests (currently only remote HEAD).
      • Variable Labels: bucket
    • AppendCount: Total number of executed APPEND(object) requests.
      • Variable Labels: bucket
    • DeleteCount: Total number of executed DELETE(object) requests.
      • Variable Labels: bucket
    • RenameCount: Total number of executed rename(object) requests.
      • Variable Labels: bucket
    • ListCount: Total number of executed list-objects requests.
      • Variable Labels: bucket

Common Error Counters

  • Error Metrics:
    • ErrGetCount: Total number of GET(object) errors.
      • Variable Labels: bucket
    • ErrPutCount: Total number of PUT(object) errors.
      • Variable Labels: bucket, xkind
    • ErrHeadCount: Total number of HEAD(object) errors.
      • Variable Labels: bucket
    • ErrAppendCount: Total number of APPEND(object) errors.
      • Variable Labels: bucket
    • ErrDeleteCount: Total number of DELETE(object) errors.
      • Variable Labels: bucket
    • ErrRenameCount: Total number of rename(object) errors.
      • Variable Labels: bucket
    • ErrListCount: Total number of list-objects errors.
      • Variable Labels: bucket

Common Latencies

  • Latency Metrics:
    • GetLatency: GET average time (milliseconds) over the last periodic.stats_time interval.
      • Variable Labels: bucket
    • GetLatencyTotal: GET total cumulative time (nanoseconds).
      • Variable Labels: bucket
    • ListLatency: List-objects average time (milliseconds) over the last periodic.stats_time interval.
      • Variable Labels: bucket

For convenience, we also include here a (somewhat redundant) table that summarizes common metrics.

Internal namePublic nameInternal TypeDescription (Prometheus help)Prometheus labels
get.nget_countcountertotal number of executed GET(object) requestsdefault
put.nput_countcountertotal number of executed PUT(object) requestsdefault
head.nhead_countcountertotal number of executed HEAD(object) requestsdefault
append.nappend_countcountertotal number of executed APPEND(object) requestsdefault
del.ndel_countcountertotal number of executed DELETE(object) requestsdefault
ren.nren_countcountertotal number of executed rename(object) requestsdefault
lst.nlst_countcountertotal number of executed list-objects requestsdefault
err.get.nerr_get_countcountertotal number of GET(object) errorsdefault
err.put.nerr_put_countcountertotal number of PUT(object) errorsdefault
err.head.nerr_head_countcountertotal number of HEAD(object) errorsdefault
err.append.nerr_append_countcountertotal number of APPEND(object) errorsdefault
err.del.nerr_del_countcountertotal number of DELETE(object) errorsdefault
err.ren.nerr_ren_countcountertotal number of rename(object) errorsdefault
err.lst.nerr_lst_countcountertotal number of list-objects errorsdefault
err.http.write.nerr_http_write_countcountertotal number of HTTP write-response errorsdefault
err.dl.nerr_dl_countcounterdownloader: number of download errorsdefault
err.put.mirror.nerr_put_mirror_countcounternumber of n-way mirroring errorsdefault
get.nsget_mslatencyGET: average time (milliseconds) over the last periodic.stats_time intervaldefault
get.ns.totalget_ns_totaltotalGET: total cumulative time (nanoseconds)default
lst.nslst_mslatencylist-objects: average time (milliseconds) over the last periodic.stats_time intervaldefault
kalive.nskalive_mslatencyin-cluster keep-alive (heartbeat): average time (milliseconds) over the last periodic.stats_time intervaldefault
up.ns.timeuptimespecialthis node’s uptime since its startup (seconds)default
state.flagsstate_flagsgaugebitwise 64-bit value that carries enumerated node-state flags, including warnings and alerts; see https://github.com/NVIDIA/aistore/blob/main/cmn/cos/node_state.go

Target metrics

  • Out-of-Band Metrics:

    • VerChangeCount: Number of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster).
      • Variable Labels: bucket
    • VerChangeSize: Total cumulative size (bytes) of objects updated out-of-band across all backends combined.
      • Variable Labels: bucket
    • RemoteDeletedDelCount: Number of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster).
      • Variable Labels: bucket
  • PUT Latency Metrics:

    • PutLatency: PUT average time (milliseconds) over the last periodic.stats_time interval.
      • Variable Labels: bucket, xkind
    • PutLatencyTotal: PUT total cumulative time (nanoseconds).
      • Variable Labels: bucket, xkind
  • HEAD Latency Metrics:

    • HeadLatencyTotal: HEAD total cumulative time (nanoseconds).
      • Variable Labels: bucket
  • APPEND Latency Metrics:

    • AppendLatency: APPEND average time (milliseconds) over the last periodic.stats_time interval.
      • Variable Labels: bucket
  • Throughput Metrics:

    • GetThroughput: GET average throughput (MB/s) over the last periodic.stats_time interval.
      • Variable Labels: bucket
    • PutThroughput: PUT average throughput (MB/s) over the last periodic.stats_time interval.
      • Variable Labels: bucket, xkind
  • Size Metrics:

    • GetSize: GET total cumulative size (bytes).
      • Variable Labels: bucket
    • PutSize: PUT total cumulative size (bytes).
      • Variable Labels: bucket, xkind
  • Error Metrics:

    • ErrPutCksumCount: PUT number of checksum errors.
      • Variable Labels: bucket, xkind
    • ErrFSHCCount: Number of times filesystem health checker (FSHC) was triggered by an I/O error or errors.
      • Variable Labels: mountpath
    • IOErrGetCount: GET number of I/O errors (excluding remote backend and network errors).
      • Variable Labels: bucket
    • IOErrDeleteCount: DELETE(object) number of I/O errors (excluding remote backend and network errors).
      • Variable Labels: bucket

For convenience, a table that summarizes target metrics follows below.

Internal namePublic nameInternal TypeDescription (Prometheus help)Prometheus labels
disk.<DISK-NAME>.read.bpsdisk_read_mbpscomputed-bandwidthread bandwidth (MB/s)map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>]
disk.<DISK-NAME>.avg.rsizedisk_avg_rsizegaugeaverage read size (bytes)map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>]
disk.<DISK-NAME>.write.bpsdisk_write_mbpscomputed-bandwidthwrite bandwidth (MB/s)map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>]
disk.<DISK-NAME>.avg.wsizedisk_avg_wsizegaugeaverage write size (bytes)map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>]
disk.<DISK-NAME>.utildisk_utilgaugedisk utilization (%%)map[disk:<DISK-NAME> node_id:<AIS-NODE-ID>]
lru.evict.nlru_evict_countcounternumber of LRU evictionsdefault
lru.evict.sizelru_evict_bytessizetotal cumulative size (bytes) of LRU evictionsdefault
cleanup.store.ncleanup_store_countcounterspace cleanup: number of removed misplaced objects and old work filesdefault
cleanup.store.sizecleanup_store_bytessizespace cleanup: total size (bytes) of all removed misplaced objects and old work files (not including removed deleted objects)default
ver.change.nver_change_countcounternumber of out-of-band updates (by a 3rd party performing remote PUTs from outside this cluster)default
ver.change.sizever_change_bytessizetotal cumulative size (bytes) of objects that were updated out-of-band across all backends combineddefault
remote.deleted.del.nremote_deleted_del_countcounternumber of out-of-band deletes (by a 3rd party remote DELETE(object) from outside this cluster)default
put.nsput_mslatencyPUT: average time (milliseconds) over the last periodic.stats_time intervaldefault
put.ns.totalput_ns_totaltotalPUT: total cumulative time (nanoseconds)default
append.nsappend_mslatencyAPPEND(object): average time (milliseconds) over the last periodic.stats_time intervaldefault
get.redir.nsget_redir_mslatencyGET: average gateway-to-target HTTP redirect latency (milliseconds) over the last periodic.stats_time intervaldefault
put.redir.nsput_redir_mslatencyPUT: average gateway-to-target HTTP redirect latency (milliseconds) over the last periodic.stats_time intervaldefault
ratelim.retry.get.nratelim_retry_get_ncounterGET: number of rate-limited retries triggered by remote backends returning 409 and 503 status codesdefault
ratelim.retry.get.ns.totalratelim_retry_get_ns_totaltotalGET: total retrying time (nanoseconds) caused by remote backends returning 409 and 503 status codesdefault
ratelim.retry.put.nratelim_retry_put_ncounterPUT: number of rate-limited retries triggered by remote backends returning 409 and 503 status codesdefault
ratelim.retry.put.ns.totalratelim_retry_put_ns_totaltotalPUT: total retrying time (nanoseconds) caused by remote backends returning 409 and 503 status codesdefault
get.bpsget_mbpsbandwidthGET: average throughput (MB/s) over the last periodic.stats_time intervaldefault
put.bpsput_mbpsbandwidthPUT: average throughput (MB/s) over the last periodic.stats_time intervaldefault
get.sizeget_bytessizeGET: total cumulative size (bytes)default
put.sizeput_bytessizePUT: total cumulative size (bytes)default
err.cksum.nerr_cksum_countcounterPUT: number of checksum errorsdefault
err.fshc.nerr_fshc_countcounternumber of times filesystem health checker (FSHC) was triggered by an I/O error or errorsdefault
err.io.get.nerr_io_get_countcounterGET: number of I/O errors not including remote backend and network errorsdefault
err.io.put.nerr_io_put_countcounterPUT: number of I/O errors not including remote backend and network errorsdefault
err.io.del.nerr_io_del_countcounterDELETE(object): number of I/O errors not including remote backend and network errorsdefault
stream.out.nstream_out_countcounterintra-cluster streaming communications: number of sent objectsdefault
stream.out.sizestream_out_bytessizeintra-cluster streaming communications: total cumulative size (bytes) of all transmitted objectsdefault
stream.in.nstream_in_countcounterintra-cluster streaming communications: number of received objectsdefault
stream.in.sizestream_in_bytessizeintra-cluster streaming communications: total cumulative size (bytes) of all received objectsdefault
dl.sizedl_bytessizetotal downloaded size (bytes)default
dl.ns.totaldl_ns_totaltotaltotal downloading time (nanoseconds)default
dsort.creation.req.ndsort_creation_req_countcounterdsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metricsdefault
dsort.creation.resp.ndsort_creation_resp_countcounterdsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metricsdefault
dsort.creation.resp.nsdsort_creation_resp_mslatencydsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metricsdefault
dsort.extract.shard.dsk.ndsort_extract_shard_dsk_countcounterdsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metricsdefault
dsort.extract.shard.mem.ndsort_extract_shard_mem_countcounterdsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metricsdefault
dsort.extract.shard.sizedsort_extract_shard_bytessizedsort: see https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md#metricsdefault
lcache.collision.nlcache_collision_countcounternumber of LOM cache collisions (core, internal)default
lcache.evicted.nlcache_evicted_countcounternumber of LOM cache evictions (core, internal)default
lcache.flush.cold.nlcache_flush_cold_countcounternumber of times a LOM from cache was written to stable storage (core, internal)default
remais.get.nremote_get_countcounterGET: total number of executed remote requestsmap[backend:remais node_id:<AIS-NODE-ID>]
remais.get.ns.totalremote_get_ns_totaltotalGET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objectsmap[backend:remais node_id:<AIS-NODE-ID>]
remais.get.sizeremote_get_bytes_totalsizeGET: total cumulative size (bytes) of all remote GET transactionsmap[backend:remais node_id:<AIS-NODE-ID>]
remais.head.nremote_head_countcounterHEAD: total number of executed remote requests to a given backendmap[backend:remais node_id:<AIS-NODE-ID>]
remais.put.nremote_put_countcounterPUT: total number of executed remote requests to a given backendmap[backend:remais node_id:<AIS-NODE-ID>]
remais.put.ns.totalremote_put_ns_totaltotalPUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-clustermap[backend:remais node_id:<AIS-NODE-ID>]
remais.e2e.put.ns.totalremote_e2e_put_ns_totaltotalPUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster objectmap[backend:remais node_id:<AIS-NODE-ID>]
remais.put.sizeremote_e2e_put_bytes_totalsizePUT: total cumulative size (bytes) of all PUTs to a given remote backendmap[backend:remais node_id:ClCt8081]
remais.ver.change.nremote_ver_change_countcounternumber of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)map[backend:remais node_id:<AIS-NODE-ID>]
remais.ver.change.sizeremote_ver_change_bytes_totalsizetotal cumulative size of objects that were updated out-of-bandmap[backend:remais node_id:<AIS-NODE-ID>]
gcp.get.nremote_get_countcounterGET: total number of executed remote requestsmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.get.ns.totalremote_get_ns_totaltotalGET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objectsmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.get.sizeremote_get_bytes_totalsizeGET: total cumulative size (bytes) of all remote transactionsmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.head.nremote_head_countcounterHEAD: total number of executed remote requests to a given backendmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.put.nremote_put_countcounterPUT: total number of executed remote requests to a given backendmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.put.ns.totalremote_put_ns_totaltotalPUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-clustermap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.e2e.put.ns.totalremote_e2e_put_ns_totaltotalPUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster objectmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.put.sizeremote_e2e_put_bytes_totalsizePUT: total cumulative size (bytes) of all PUTs to a given remote backendmap[backend:gcp node_id:<AIS-NODE-ID>]
gcp.ver.change.nremote_ver_change_countcounternumber of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)map[backend:gcp node_id:<AIS-NODE-ID>]
gcp.ver.change.sizeremote_ver_change_bytes_totalsizetotal cumulative size of objects that were updated out-of-bandmap[backend:gcp node_id:<AIS-NODE-ID>]
aws.get.nremote_get_countcounterGET: total number of executed remote requestsmap[backend:aws node_id:<AIS-NODE-ID>]
aws.get.ns.totalremote_get_ns_totaltotalGET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objectsmap[backend:aws node_id:<AIS-NODE-ID>]
aws.get.sizeremote_get_bytes_totalsizeGET: total cumulative size (bytes) of all remote transactionsmap[backend:aws node_id:<AIS-NODE-ID>]
aws.head.nremote_head_countcounterHEAD: total number of executed remote requests to a given backendmap[backend:aws node_id:<AIS-NODE-ID>]
aws.put.nremote_put_countcounterPUT: total number of executed remote requests to a given backendmap[backend:aws node_id:<AIS-NODE-ID>]
aws.put.ns.totalremote_put_ns_totaltotalPUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-clustermap[backend:aws node_id:<AIS-NODE-ID>]
aws.e2e.put.ns.totalremote_e2e_put_ns_totaltotalPUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster objectmap[backend:aws node_id:<AIS-NODE-ID>]
aws.put.sizeremote_e2e_put_bytes_totalsizePUT: total cumulative size (bytes) of all PUTs to a given remote backendmap[backend:aws node_id:<AIS-NODE-ID>]
aws.ver.change.nremote_ver_change_countcounternumber of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)map[backend:aws node_id:<AIS-NODE-ID>]
aws.ver.change.sizeremote_ver_change_bytes_totalsizetotal cumulative size of objects that were updated out-of-bandmap[backend:aws node_id:<AIS-NODE-ID>]
azure.get.nremote_get_countcounterGET: total number of executed remote requestsmap[backend:azure node_id:<AIS-NODE-ID>]
azure.get.ns.totalremote_get_ns_totaltotalGET: total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objectsmap[backend:azure node_id:<AIS-NODE-ID>]
azure.get.sizeremote_get_bytes_totalsizeGET: total cumulative size (bytes) of all remote transactionsmap[backend:azure node_id:<AIS-NODE-ID>]
azure.head.nremote_head_countcounterHEAD: total number of executed remote requests to a given backendmap[backend:azure node_id:<AIS-NODE-ID>]
azure.put.nremote_put_countcounterPUT: total number of executed remote requests to a given backendmap[backend:azure node_id:<AIS-NODE-ID>]
azure.put.ns.totalremote_put_ns_totaltotalPUT: total cumulative time (nanoseconds) to execute remote requests and store new object versions in-clustermap[backend:azure node_id:<AIS-NODE-ID>]
azure.e2e.put.ns.totalremote_e2e_put_ns_totaltotalPUT: total end-to-end time (nanoseconds) servicing remote requests; includes: receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster objectmap[backend:azure node_id:<AIS-NODE-ID>]
azure.put.sizeremote_e2e_put_bytes_totalsizePUT: total cumulative size (bytes) of all PUTs to a given remote backendmap[backend:azure node_id:<AIS-NODE-ID>]
azure.ver.change.nremote_ver_change_countcounternumber of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster)map[backend:azure node_id:<AIS-NODE-ID>]
azure.ver.change.sizeremote_ver_change_bytes_totalsizetotal cumulative size of objects that were updated out-of-bandmap[backend:azure node_id:<AIS-NODE-ID>]

Backend metrics

  • GET Metrics:

    • remote_get_count: Total number of executed remote GET requests.
      • Variable Labels: bucket
    • remote_get_ns_total: Total cumulative time (nanoseconds) to execute remote requests and store, copy, or transform objects.
      • Variable Labels: bucket
    • remote_get_bytes_total: Total cumulative size (bytes) of all remote GET transactions.
      • Variable Labels: bucket
  • PUT Metrics:

    • remote_put_count: Total number of executed remote PUT requests to a given backend.
      • Variable Labels: bucket, xkind
    • remote_put_ns_total: Total cumulative time (nanoseconds) to execute remote PUT requests and store new object versions in-cluster.
      • Variable Labels: bucket, xkind
    • remote_e2e_put_ns_total: Total end-to-end time (nanoseconds) servicing remote PUT requests (includes receiving PUT payload, storing it in-cluster, executing remote PUT, finalizing new in-cluster object).
      • Variable Labels: bucket, xkind
    • remote_e2e_put_bytes_total: Total cumulative size (bytes) of all PUTs to a given remote backend.
      • Variable Labels: bucket, xkind
  • HEAD Metrics:

    • remote_head_count: Total number of executed remote HEAD requests to a given backend.
      • Variable Labels: bucket
    • remote_head_ns_total: Total cumulative time (nanoseconds) to execute remote HEAD requests.
      • Variable Labels: bucket
  • Out-of-Band Updates:

    • remote_ver_change_count: Number of out-of-band updates (by a 3rd party performing remote PUTs outside this cluster).
      • Variable Labels: bucket
    • remote_ver_change_bytes_total: Total cumulative size (bytes) of objects that were updated out-of-band.
      • Variable Labels: bucket
DocumentDescription
OverviewIntroduction to AIS observability
CLICommand-line monitoring tools
LogsLog-based observability
PrometheusConfiguring Prometheus with AIS
GrafanaVisualizing AIS metrics with Grafana
KubernetesWorking with Kubernetes monitoring stacks