For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Table of Contents
  • Overview
  • Monitoring Stack
  • Prometheus Integration
  • Native Exporter
  • Viewing Raw Metrics
  • Key Metric Groups
  • Metric Labels
  • Essential PromQL Queries
  • GET operations per second
  • Average GET latency (ms)
  • Disk utilization
  • GET error percentage
  • Total cluster capacity usage
  • GetBatch (x-moss) Queries
  • Work items per second
  • Logical payload throughput
  • Stall breakdown (RxWait vs Throttle)
  • Soft vs Hard Error Rates
  • Node Alerts
  • Red (critical)
  • Warning
  • Informational
  • CLI Monitoring
  • Prometheus Queries
  • Grafana Alert Example
  • Best Practices
  • Related Documentation
Observability, Monitoring, and Performance

AIStore Observability: Prometheus

||View as Markdown|
Previous

Logs

Next

Metrics reference

AIStore (AIS) exposes metrics in Prometheus format via HTTP endpoints.

This integration enables comprehensive monitoring of AIS clusters, performance tracking, capacity planning, and long-term trend analysis.

Table of Contents

  • Overview
  • Monitoring Stack
  • Prometheus Integration
    • Native Exporter
    • Viewing Raw Metrics
    • Key Metric Groups
    • Metric Labels
    • Essential PromQL Queries
    • GetBatch (x-moss) Queries
  • Node Alerts
    • CLI Monitoring
    • Prometheus Queries
    • Grafana Alerting
  • Best Practices
  • References
  • Related Documentation

Overview

AIS tracks a comprehensive set of performance metrics including:

  • Operation counters (GET/PUT/DELETE/etc.)
  • Resource utilization (CPU, memory, disk)
  • Latencies and throughput
  • Network and peer-to-peer streaming statistics
  • Extended actions (xactions)
  • Error counters and node state

AIS supports observability through several complementary tools:

  • Node logs (fine-grained operational events)

  • CLI for interactive monitoring (e.g., ais show cluster stats)

  • Monitoring backends:

    • Prometheus (recommended)
    • Grafana for dashboards & alerting

For load testing and benchmarking metrics, see AIS Load Generator and How To Benchmark AIStore.

A complete catalog of AIS metrics is available at: Monitoring Metrics Reference


Monitoring Stack

Typical Prometheus deployment:

┌────────────────┐ ┌────────────────┐
│ │ scrape│ │
│ Prometheus │◄──────┤ AIStore Node │
│ │ │ /metrics │
└────────────────┘ └────────────────┘
││
││ query
▼
┌────────────────┐
│ │
│ Grafana │
│ │
└────────────────┘

This stack provides:

  • Direct metric collection from AIS nodes
  • Centralized metric retention
  • Grafana visualization & alerting
  • Long-term performance & cost analysis

Prometheus Integration

Native Exporter

AIS acts as a first-class Prometheus exporter. Every node automatically:

  1. Registers all metrics at startup
  2. Exposes /metrics for Prometheus to scrape
  3. Uses Prometheus native formatting and metadata
  4. Works with both HTTP and HTTPS clusters

No configuration is required to “enable” Prometheus — it is always on.

AIS source metrics (put.size, get.ns, etc.) are exported with AIS naming conventions:

ais_target_<metric_name>{node_id="T1"} <value>

This document primarily uses the exported Prometheus names.


Viewing Raw Metrics

View metrics directly:

$$ curl http://<node>:<port>/metrics
$# or
$$ curl https://<node>:<port>/metrics

Example:

# HELP ais_target_put_bytes total bytes served via PUT
# TYPE ais_target_put_bytes counter
ais_target_put_bytes{node_id="ClCt8081"} 1.721761792e+10
# HELP ais_target_put_ns_total total PUT latency (nanoseconds)
# TYPE ais_target_put_ns_total counter
ais_target_put_ns_total{node_id="ClCt8081"} 9.44367232e+09
# HELP ais_target_state_flags node state and alert flags
# TYPE ais_target_state_flags gauge
ais_target_state_flags{node_id="ClCt8081"} 6

To watch GET rates without Prometheus:

$for i in {1..99999}; do
$ curl -s http://hostname:8081/metrics | grep "ais_target_get_count"
$ sleep 1
$done

Key Metric Groups

AIS organizes metrics into four major groups, reflected in the codebase and Prometheus exporter:

GroupDescriptionExamples
1. DatapathGET/PUT counters, sizes, latencies, rate-limiting, I/O errorsais_target_get_count, ais_target_put_bps, ais_target_ratelim_retry_get_n
2. Metadata (in-memory)Lcache activity (evictions, collisions)ais_target_lcache_evicted_count
3. Extended Actions (xactions)Background & multi-object jobs: LRU, EC, rebalance, ETL, Download, DSort, GetBatchais_target_lru_evict_n, ais_target_getbatch_n
4. StreamsLong-lived peer-to-peer (SharedDM) streaming channelsais_target_streams_out_obj_n

For GetBatch observability, see Monitoring GetBatch.


Metric Labels

AIS exposes labels for filtering and aggregation:

LabelUsage
node_idNode identity (target or gateway)
diskDisk name for per-disk metrics
bucketSource/destination bucket
xactionXaction UUID for multi-object jobs
sliceFor erasure coding slice metrics
archpathFor per-file shard extraction (GetBatch)

Labels enable PromQL queries such as:

sum by (node_id)(rate(ais_target_put_bytes[5m]))
sum by (disk)(ais_target_disk_util)

Essential PromQL Queries

GET operations per second

1sum(rate(ais_target_get_count[5m]))

Average GET latency (ms)

1sum(rate(ais_target_get_ns_total[5m]))
2/ sum(rate(ais_target_get_count[5m]))
3/ 1e6 # convert ns → ms

Disk utilization

1ais_target_disk_util{disk="nvme0n1"}

GET error percentage

1sum(rate(ais_target_err_get_count[5m]))
2/ sum(rate(ais_target_get_count[5m])) * 100

Total cluster capacity usage

1sum(ais_target_capacity_used)
2/
3sum(ais_target_capacity_total)
4* 100

GetBatch (x-moss) Queries

GetBatch is AIStore’s high-performance multi-object retrieval pipeline. Metrics describe throughput, composition (objects vs files), Rx stalls, throttling, and error behavior.

Work items per second

1sum(rate(ais_target_getbatch_n[5m]))

Logical payload throughput

1sum(rate(ais_target_getbatch_obj_size[5m]))
2+
3sum(rate(ais_target_getbatch_file_size[5m]))

Stall breakdown (RxWait vs Throttle)

1sum(rate(ais_target_getbatch_rxwait_ns[5m]))
2/
3(
4 sum(rate(ais_target_getbatch_rxwait_ns[5m])) +
5 sum(rate(ais_target_getbatch_throttle_ns[5m]))
6)

Soft vs Hard Error Rates

1rate(ais_target_err_soft_getbatch_n[5m])
2rate(ais_target_err_getbatch_n[5m])

Full details and operational guidance: → Monitoring GetBatch


Node Alerts

AIS nodes expose operational alerts and states via ais_target_state_flags. Flags indicate:

Red (critical)

  • OOS — Out of space
  • OOM — Out of memory
  • DiskFault — Disk failures
  • NumGoroutines — excessive goroutines
  • CertificateExpired — TLS expiry
  • KeepAliveErrors — peer connectivity issues

Warning

  • Rebalancing, Resilvering
  • RebalanceInterrupted, ResilverInterrupted
  • LowCapacity, LowMemory, LowCPU
  • NodeRestarted
  • MaintenanceMode
  • CertWillSoonExpire

Informational

  • ClusterStarted
  • NodeStarted
  • VoteInProgress

CLI Monitoring

$$ ais show cluster

Prometheus Queries

Critical:

1ais_target_state_flags & 8192 > 0 # OOS
2or ais_target_state_flags & 16384 > 0 # OOM
3or ais_target_state_flags & 65536 > 0 # DiskFault

Warnings:

1ais_target_state_flags & 4096 > 0 # LowCapacity
2or ais_target_state_flags & 8192 > 0 # LowMemory

Grafana Alert Example

ais_target_state_flags{node_id=~"$node"} & 8192 > 0

Best Practices

  1. Prometheus retention Plan retention around performance analysis needs (14–30 days recommended).

  2. Dashboard segmentation Maintain dashboards for:

    • Cluster overview
    • Node-level performance
    • Resource utilization
    • Error monitoring
    • Extended actions (GetBatch, rebalance, ETL)
  3. Alerts on critical states Monitor:

    • Node state flags
    • Error spikes
    • Disk utilization
    • High throttle or rxwait stalls (GetBatch)
  4. Scrape frequency 5–15 seconds for critical workloads; 30s+ for low-traffic clusters.


Related Documentation

DocumentDescription
OverviewAIS observability introduction
CLICLI monitoring and commands
LogsLog-based observability
Metrics ReferenceFull AIS metric catalog
GrafanaGrafana dashboards
KubernetesK8s deployment monitoring
GetBatch MonitoringMulti-object retrieval metrics and analysis

Separately, Prometheus references:

  • Prometheus Exporters
  • Prometheus Data Model
  • Prometheus Metric Types