AIStore Observability
This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.
Observability Architecture
AIS provides multiple layers of observability:
Metrics Backend
AIStore exposes metrics via Prometheus.
StatsD was deprecated in v3.28 (Spring 2025) and completely removed in v4.0 (September 2025).
Observability Methods
Kubernetes Integration
For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.
There’s a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.
See the Kubernetes Observability document for details.
Key Metrics Categories
AIS exposes metrics across several categories:
- Cluster Health: Node status, membership changes
- Resource Usage: CPU, memory, disk utilization
- Performance: Throughput, latency, error counts
- Storage Operations: GET/PUT rates, object counts, error counts
- Errors: Network errors (“broken pipe”, “connection reset”), timeouts (“deadline exceeded”), retries (“too-many-requests”), disk faults, OOM, out-of-space, and more
In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).
Briefly, two CLI examples:
Cluster performance: operation counts and latency
Batch job: Prefetch
Best Practices
- Configure appropriate log levels based on your deployment stage (development or production).
- Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
- Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
- View or download logs via Loki. You can also use the CLI commands
ais logorais cluster download-logs(use--helpfor details) to access logs for troubleshooting and analysis.