For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Observability Architecture
  • Metrics Backend
  • Observability Methods
  • Kubernetes Integration
  • Key Metrics Categories
  • Cluster performance: operation counts and latency
  • Batch job: Prefetch
  • Best Practices
  • Further Reading
Core Documentation

AIStore Observability

||View as Markdown|
Previous

Buckets: design, operations, namespaces, and system buckets

Next

CLI overview

This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.

Observability Architecture

AIS provides multiple layers of observability:

┌─────────────────────────────────┐
│ Visualization Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Grafana │ │ Custom │ │
│ │ Dashboard │ │ UIs │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Collection Layer │
│ ┌───────────┐ │
│ │ Prometheus│ │
│ │ │ │
│ └───────────┘ │
├─────────────────────────────────┤
│ Instrumentation Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Metrics │ │ Logs │ │
│ │ Endpoints │ │ │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Access Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ CLI │ │ REST │ │
│ │ Interface │ │ APIs │ │
│ └───────────┘ └───────────┘ │
└─────────────────────────────────┘

Metrics Backend

AIStore exposes metrics via Prometheus.

StatsD was deprecated in v3.28 (Spring 2025) and completely removed in v4.0 (September 2025).

Observability Methods

MethodDescriptionUse CasesDocumentation
CLICommand-line tools for monitoring and troubleshootingQuick checks, diagnostics, interactive troubleshootingObservability: CLI
LogsDetailed event logs with configurable verbosityDebugging, audit trails, understanding system behaviorObservability: Logs
PrometheusTime-series metrics exposed via HTTP endpointsPerformance monitoring, alerting, trend analysisObservability: Prometheus
Metrics ReferenceMetric groups, names, and descriptionsQuick search for specific metricObservability: Metrics Reference
GrafanaVisualization dashboards for AIS metricsVisual monitoring, sharing operational statusObservability: Grafana
KubernetesKubernetes deploymentsWorking with Kubernetes monitoring stacksObservability: Kubernetes

Kubernetes Integration

For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.

There’s a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.

See the Kubernetes Observability document for details.

Key Metrics Categories

AIS exposes metrics across several categories:

  • Cluster Health: Node status, membership changes
  • Resource Usage: CPU, memory, disk utilization
  • Performance: Throughput, latency, error counts
  • Storage Operations: GET/PUT rates, object counts, error counts
  • Errors: Network errors (“broken pipe”, “connection reset”), timeouts (“deadline exceeded”), retries (“too-many-requests”), disk faults, OOM, out-of-space, and more

In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).

Briefly, two CLI examples:

Cluster performance: operation counts and latency

1$ ais performance latency --refresh 10 --regex get
2
3| TARGET | AWS-GET(n) | AWS-GET(t) | GET(n) | GET(t) | GET(total/avg size) | RATELIM-RETRY-GET(n) | RATELIM-RETRY-GET(t) |
4|:------:|:----------:|:----------:|:------:|:------:|:--------------------:|:---------------------:|:---------------------:|
5| T1 | 800 | 180ms | 3200 | 25ms | 12GB / 3.75MB | 50 | 240ms |
6| T2 | 1000 | 150ms | 4000 | 28ms | 15GB / 3.75MB | 70 | 230ms |
7| T3 | 700 | 200ms | 2800 | 32ms | 10GB / 3.57MB | 40 | 215ms |
8
9- **AWS-GET(n)** / **AWS-GET(t)**: Number and average latency of GET requests that actually hit the AWS backend.
10- **GET(n)** / **GET(t)**: Number and average latency of *all* GET requests (including those served from local cache or in-cluster data).
11- **GET(total/avg size)**: Approximate total data read and corresponding average object size.
12- **RATELIM-RETRY-GET(n)** / **RATELIM-RETRY-GET(t)**: Number and average latency of GET requests retried due to hitting the rate limit.

Batch job: Prefetch

1$ ais show job prefetch --refresh 10
2
3prefetch-objects[MV4ex8u6h] (ctl: prefix:10, workers: 16, parallelism: w[16] chan-full[8,32])
4NODE ID KIND BUCKET OBJECTS BYTES START END STATE
5KactABCD MV4ex8u6h prefetch-listrange s3://cloud-bucket 27 27.00MiB 18:28:55 - Running
6XXytEFGH MV4ex8u6h prefetch-listrange s3://cloud-bucket 23 23.00MiB 18:28:55 - Running
7YMjtIJKL MV4ex8u6h prefetch-listrange s3://cloud-bucket 41 41.00MiB 18:28:55 - Running
8oJXtMNOP MV4ex8u6h prefetch-listrange s3://cloud-bucket 34 34.00MiB 18:28:55 - Running
9vWrtQRST MV4ex8u6h prefetch-listrange s3://cloud-bucket 23 23.00MiB 18:28:55 - Running
10ybTtUVWX MV4ex8u6h prefetch-listrange s3://cloud-bucket 31 31.00MiB 18:28:55 - Running
11 Total: 179 179.00MiB ✓

Best Practices

  • Configure appropriate log levels based on your deployment stage (development or production).
  • Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
  • Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
  • View or download logs via Loki. You can also use the CLI commands ais log or ais cluster download-logs (use --help for details) to access logs for troubleshooting and analysis.

Further Reading

  • Performance Tuning
  • AIS K8s Playbooks: host configuration
  • Prometheus Documentation
  • Grafana Documentation