Observability overview | NVIDIA AIStore

This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.

Observability Architecture

AIS provides multiple layers of observability:

┌─────────────────────────────────┐
│       Visualization Layer       │
│  ┌───────────┐    ┌───────────┐ │
│  │  Grafana  │    │   Custom  │ │
│  │ Dashboard │    │   UIs     │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│       Collection Layer          │
│         ┌───────────┐           │
│         │ Prometheus│           │
│         │           │           │
│         └───────────┘           │
├─────────────────────────────────┤
│       Instrumentation Layer     │
│  ┌───────────┐    ┌───────────┐ │
│  │  Metrics  │    │   Logs    │ │
│  │ Endpoints │    │           │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│          Access Layer           │
│  ┌───────────┐    ┌───────────┐ │
│  │    CLI    │    │   REST    │ │
│  │ Interface │    │   APIs    │ │
│  └───────────┘    └───────────┘ │
└─────────────────────────────────┘

Metrics Backend

AIStore exposes metrics via Prometheus.

StatsD was deprecated in v3.28 (Spring 2025) and completely removed in v4.0 (September 2025).

Observability Methods

Method	Description	Use Cases	Documentation
CLI	Command-line tools for monitoring and troubleshooting	Quick checks, diagnostics, interactive troubleshooting	Observability: CLI
Logs	Detailed event logs with configurable verbosity	Debugging, audit trails, understanding system behavior	Observability: Logs
Prometheus	Time-series metrics exposed via HTTP endpoints	Performance monitoring, alerting, trend analysis	Observability: Prometheus
Metrics Reference	Metric groups, names, and descriptions	Quick search for specific metric	Observability: Metrics Reference
Grafana	Visualization dashboards for AIS metrics	Visual monitoring, sharing operational status	Observability: Grafana
Kubernetes	Kubernetes deployments	Working with Kubernetes monitoring stacks	Observability: Kubernetes

Kubernetes Integration

For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.

There’s a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.

See the Kubernetes Observability document for details.

Key Metrics Categories

AIS exposes metrics across several categories:

Cluster Health: Node status, membership changes
Resource Usage: CPU, memory, disk utilization
Performance: Throughput, latency, error counts
Storage Operations: GET/PUT rates, object counts, error counts
Errors: Network errors (“broken pipe”, “connection reset”), timeouts (“deadline exceeded”), retries (“too-many-requests”), disk faults, OOM, out-of-space, and more

In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).

Briefly, two CLI examples:

Cluster performance: operation counts and latency

1 $ ais performance latency --refresh 10 --regex get
2 
3 | TARGET | AWS-GET(n) | AWS-GET(t) | GET(n) | GET(t) | GET(total/avg size) | RATELIM-RETRY-GET(n) | RATELIM-RETRY-GET(t) |
4 |:------:|:----------:|:----------:|:------:|:------:|:--------------------:|:---------------------:|:---------------------:|
5 | T1     | 800        | 180ms      | 3200   | 25ms   | 12GB / 3.75MB       | 50                    | 240ms                |
6 | T2     | 1000       | 150ms      | 4000   | 28ms   | 15GB / 3.75MB       | 70                    | 230ms                |
7 | T3     | 700        | 200ms      | 2800   | 32ms   | 10GB / 3.57MB       | 40                    | 215ms                |
8 
9 - **AWS-GET(n)** / **AWS-GET(t)**: Number and average latency of GET requests that actually hit the AWS backend.
10 - **GET(n)** / **GET(t)**: Number and average latency of *all* GET requests (including those served from local cache or in-cluster data).
11 - **GET(total/avg size)**: Approximate total data read and corresponding average object size.
12 - **RATELIM-RETRY-GET(n)** / **RATELIM-RETRY-GET(t)**: Number and average latency of GET requests retried due to hitting the rate limit.

Batch job: Prefetch

1 $ ais show job prefetch --refresh 10
2 
3 prefetch-objects[MV4ex8u6h] (ctl: prefix:10, workers: 16, parallelism: w[16] chan-full[8,32])
4 NODE             ID              KIND                    BUCKET          OBJECTS         BYTES           START           END     STATE
5 KactABCD         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 27              27.00MiB        18:28:55        -       Running
6 XXytEFGH         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 23              23.00MiB        18:28:55        -       Running
7 YMjtIJKL         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 41              41.00MiB        18:28:55        -       Running
8 oJXtMNOP         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 34              34.00MiB        18:28:55        -       Running
9 vWrtQRST         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 23              23.00MiB        18:28:55        -       Running
10 ybTtUVWX         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 31              31.00MiB        18:28:55        -       Running
11                                 Total:                                  179             179.00MiB ✓

Best Practices

Configure appropriate log levels based on your deployment stage (development or production).
Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
View or download logs via Loki. You can also use the CLI commands ais log or ais cluster download-logs (use --help for details) to access logs for troubleshooting and analysis.

AIStore Observability