AIStore Observability

View as Markdown

This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.

Observability Architecture

AIS provides multiple layers of observability:

┌─────────────────────────────────┐
│ Visualization Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Grafana │ │ Custom │ │
│ │ Dashboard │ │ UIs │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Collection Layer │
│ ┌───────────┐ │
│ │ Prometheus│ │
│ │ │ │
│ └───────────┘ │
├─────────────────────────────────┤
│ Instrumentation Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Metrics │ │ Logs │ │
│ │ Endpoints │ │ │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Access Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ CLI │ │ REST │ │
│ │ Interface │ │ APIs │ │
│ └───────────┘ └───────────┘ │
└─────────────────────────────────┘

Metrics Backend

AIStore exposes metrics via Prometheus.

StatsD was deprecated in v3.28 (Spring 2025) and completely removed in v4.0 (September 2025).

Observability Methods

MethodDescriptionUse CasesDocumentation
CLICommand-line tools for monitoring and troubleshootingQuick checks, diagnostics, interactive troubleshootingObservability: CLI
LogsDetailed event logs with configurable verbosityDebugging, audit trails, understanding system behaviorObservability: Logs
PrometheusTime-series metrics exposed via HTTP endpointsPerformance monitoring, alerting, trend analysisObservability: Prometheus
Metrics ReferenceMetric groups, names, and descriptionsQuick search for specific metricObservability: Metrics Reference
GrafanaVisualization dashboards for AIS metricsVisual monitoring, sharing operational statusObservability: Grafana
KubernetesKubernetes deploymentsWorking with Kubernetes monitoring stacksObservability: Kubernetes

Kubernetes Integration

For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.

There’s a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.

See the Kubernetes Observability document for details.

Key Metrics Categories

AIS exposes metrics across several categories:

  • Cluster Health: Node status, membership changes
  • Resource Usage: CPU, memory, disk utilization
  • Performance: Throughput, latency, error counts
  • Storage Operations: GET/PUT rates, object counts, error counts
  • Errors: Network errors (“broken pipe”, “connection reset”), timeouts (“deadline exceeded”), retries (“too-many-requests”), disk faults, OOM, out-of-space, and more

In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).

Briefly, two CLI examples:

Cluster performance: operation counts and latency

1$ ais performance latency --refresh 10 --regex get
2
3| TARGET | AWS-GET(n) | AWS-GET(t) | GET(n) | GET(t) | GET(total/avg size) | RATELIM-RETRY-GET(n) | RATELIM-RETRY-GET(t) |
4|:------:|:----------:|:----------:|:------:|:------:|:--------------------:|:---------------------:|:---------------------:|
5| T1 | 800 | 180ms | 3200 | 25ms | 12GB / 3.75MB | 50 | 240ms |
6| T2 | 1000 | 150ms | 4000 | 28ms | 15GB / 3.75MB | 70 | 230ms |
7| T3 | 700 | 200ms | 2800 | 32ms | 10GB / 3.57MB | 40 | 215ms |
8
9- **AWS-GET(n)** / **AWS-GET(t)**: Number and average latency of GET requests that actually hit the AWS backend.
10- **GET(n)** / **GET(t)**: Number and average latency of *all* GET requests (including those served from local cache or in-cluster data).
11- **GET(total/avg size)**: Approximate total data read and corresponding average object size.
12- **RATELIM-RETRY-GET(n)** / **RATELIM-RETRY-GET(t)**: Number and average latency of GET requests retried due to hitting the rate limit.

Batch job: Prefetch

1$ ais show job prefetch --refresh 10
2
3prefetch-objects[MV4ex8u6h] (ctl: prefix:10, workers: 16, parallelism: w[16] chan-full[8,32])
4NODE ID KIND BUCKET OBJECTS BYTES START END STATE
5KactABCD MV4ex8u6h prefetch-listrange s3://cloud-bucket 27 27.00MiB 18:28:55 - Running
6XXytEFGH MV4ex8u6h prefetch-listrange s3://cloud-bucket 23 23.00MiB 18:28:55 - Running
7YMjtIJKL MV4ex8u6h prefetch-listrange s3://cloud-bucket 41 41.00MiB 18:28:55 - Running
8oJXtMNOP MV4ex8u6h prefetch-listrange s3://cloud-bucket 34 34.00MiB 18:28:55 - Running
9vWrtQRST MV4ex8u6h prefetch-listrange s3://cloud-bucket 23 23.00MiB 18:28:55 - Running
10ybTtUVWX MV4ex8u6h prefetch-listrange s3://cloud-bucket 31 31.00MiB 18:28:55 - Running
11 Total: 179 179.00MiB ✓

Best Practices

  • Configure appropriate log levels based on your deployment stage (development or production).
  • Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
  • Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
  • View or download logs via Loki. You can also use the CLI commands ais log or ais cluster download-logs (use --help for details) to access logs for troubleshooting and analysis.

Further Reading