For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Table of Contents
  • Architecture
  • Prerequisites
  • Deployment Options
  • Kubernetes Deployment
  • Standalone Deployment
  • Grafana Installation
  • Docker Installation
  • Standard Installation
  • Grafana Setup
  • Adding Prometheus Data Source
  • Importing AIStore Dashboards
  • Method 1: Import from JSON
  • Method 2: Manual Dashboard Creation
  • Available Dashboards
  • AIStore Cluster Dashboard
  • AIStore Kubernetes Dashboard
  • Creating Custom Dashboards
  • Key Metrics to Visualize
  • Example PromQL Queries
  • Throughput Panels
  • Latency Panels
  • Storage Usage Panel
  • Node State Panels
  • Variable Templates
  • Alert Configuration
  • Node State Alerts
  • Performance Alerts
  • Resource Alerts
  • Production Best Practices
  • Troubleshooting
  • No Data in Grafana
  • Incomplete Metrics
  • Dashboard Performance Issues
  • Further Resources
Observability, Monitoring, and Performance

AIStore Grafana Integration

||View as Markdown|
Previous

Metrics reference

Next

Kubernetes monitoring

This document explains how to integrate AIStore with Grafana for visualization and monitoring of AIStore metrics. Grafana provides powerful visualization capabilities for Prometheus metrics collected from AIStore nodes.

Table of Contents

  • Architecture
  • Prerequisites
  • Deployment Options
    • Kubernetes Deployment
    • Standalone Deployment
  • Grafana Setup
    • Adding Prometheus Data Source
    • Importing AIStore Dashboards
  • Available Dashboards
    • AIStore Cluster Dashboard
    • AIStore Kubernetes Dashboard
  • Creating Custom Dashboards
    • Key Metrics to Visualize
    • Example PromQL Queries
    • Variable Templates
  • Alert Configuration
    • Node State Alerts
    • Performance Alerts
    • Resource Alerts
  • Production Best Practices
  • Troubleshooting
  • Further Resources

Architecture

The integration architecture follows a standard monitoring pattern:

┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ │ │ │ │
│ AIStore │scrape│ Prometheus │query │ Grafana │
│ Nodes │─────▶│ Server │─────▶│ Dashboards │
│ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘
  1. AIStore nodes expose metrics endpoints (via the /metrics HTTP endpoint)
  2. Prometheus scrapes metrics at regular intervals
  3. Grafana queries Prometheus and visualizes the data through customizable dashboards

Prerequisites

  • AIStore cluster with Prometheus metrics enabled (default in current versions)
  • Prometheus server configured to scrape AIStore metrics
  • Grafana server (v9.0+ recommended)

Deployment Options

Kubernetes Deployment

For production deployments, AIStore is typically deployed on Kubernetes using the ais-k8s operator.

The github.com/NVIDIA/ais-k8s repository includes monitoring components that set up Prometheus and Grafana with preconfigured dashboards in the monitoring directory.

To deploy AIStore with monitoring on Kubernetes:

  1. Clone the ais-k8s repository:

    $git clone https://github.com/NVIDIA/ais-k8s.git
    $cd ais-k8s
  2. Deploy AIStore with the operator, which includes monitoring stack:

    $# Follow the operator deployment instructions in ais-k8s documentation
  3. The monitoring stack includes:

    • Prometheus for metrics collection
    • Grafana for visualization
    • AlertManager for alert routing
    • Preconfigured dashboards for AIStore

Standalone Deployment

For non-Kubernetes deployments or development environments, you can set up Grafana separately.

Grafana Installation

Docker Installation
$docker run -d -p 3000:3000 --name grafana grafana/grafana-oss
Standard Installation
$# Debian/Ubuntu
$sudo apt-get install -y apt-transport-https software-properties-common
$sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
$wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
$sudo apt-get update
$sudo apt-get install grafana
$
$# Start Grafana
$sudo systemctl start grafana-server

For other platforms, see the Grafana installation documentation.

Grafana Setup

Adding Prometheus Data Source

  1. Log in to Grafana (default: http://localhost:3000, admin/admin)
  2. Go to Configuration > Data Sources > Add data source
  3. Select Prometheus
  4. Set the URL to your Prometheus server (e.g., http://prometheus-server:9090)
  5. Click “Save & Test” to verify the connection

Importing AIStore Dashboards

AIStore provides pre-built Grafana dashboards for monitoring:

Method 1: Import from JSON

  1. Download the AIStore dashboard JSON file:

    • AIStore Kubernetes Dashboard
  2. In Grafana, go to Dashboards > Import

  3. Upload the JSON file or paste its contents

  4. Select your Prometheus data source

  5. Click Import

Method 2: Manual Dashboard Creation

If you prefer to build dashboards from scratch:

  1. In Grafana, create a new dashboard (+ > Create > Dashboard)
  2. Add panels with AIStore-specific metrics using the PromQL queries provided in the Example PromQL Queries section
  3. Organize panels into logical sections (cluster overview, node details, operations, etc.)
  4. Save the dashboard

Available Dashboards

AIStore Cluster Dashboard

The main AIStore dashboard provides:

  • Cluster health overview
  • Storage capacity and usage
  • Throughput and latency metrics
  • Operation rates by type
  • Error rates
  • Resource usage (CPU, memory)

AIStore Grafana Dashboard

Key panels include:

  • Cluster Health: Node status, rebalance status
  • Operations: GET/PUT/DELETE throughput, latency statistics
  • Storage: Capacity, utilization, growth trends
  • Resources: CPU, memory, network usage by node
  • Errors: Error rates, error distribution by type

AIStore Kubernetes Dashboard

For Kubernetes deployments, the specialized dashboard includes:

  • Pod status and health
  • Node resource utilization by pod
  • Storage performance by pod
  • Network metrics
  • Rebalance monitoring
  • Kubernetes-specific resource allocation

Creating Custom Dashboards

Key Metrics to Visualize

When creating custom dashboards, consider including:

  1. Cluster Health

    • Number of online nodes
    • Storage capacity and usage
    • Error rates
    • Node states and alerts
  2. Performance

    • Throughput (GET/PUT/DELETE)
    • Operation latency
    • Request rates
    • Cache hit ratios
  3. Resources

    • CPU utilization
    • Memory usage
    • Disk I/O
    • Network traffic

Example PromQL Queries

Throughput Panels

# GET throughput (bytes/sec)
sum(rate(ais_target_get_bytes[5m]))
# PUT throughput (bytes/sec)
sum(rate(ais_target_put_bytes[5m]))
# DELETE operations/sec
sum(rate(ais_target_delete_count[5m]))

Latency Panels

# GET latency (milliseconds) using PromQL
sum(rate(ais_target_get_ns_total[5m])) / sum(rate(ais_target_get_count[5m])) / 1000000
# PUT latency (milliseconds) using PromQL
sum(rate(ais_target_put_ns_total[5m])) / sum(rate(ais_target_put_count[5m])) / 1000000

Storage Usage Panel

# Cluster storage utilization percentage
100 * sum(ais_target_capacity_used) / sum(ais_target_capacity_total)
# Storage used by node
ais_target_capacity_used{node_id="$node"}

Node State Panels

# Node state flags (alert conditions)
ais_target_state_flags{node_id="$node"}
# Nodes with red alerts (OOS - out of space)
ais_target_state_flags > 0 and ais_target_state_flags & 8192 > 0

Variable Templates

Create dashboard variables to make your dashboard more interactive:

  1. Dashboard Settings > Variables > New
  2. Create a variable for node selection:
    • Name: node
    • Type: Query
    • Data source: Prometheus
    • Query: label_values(ais_target_uptime, node_id)

Use the variable in your queries: {node_id="$node"}

Alert Configuration

Grafana can be configured to send alerts based on metric thresholds:

Node State Alerts

Set up alerts based on the AIStore node state flags (refer to Node Alerts in AIStore Prometheus docs for all available states):

  1. Create an alert for red alert conditions:

    • Condition: ais_target_state_flags{node_id=~"$node"} > 0 and (ais_target_state_flags{node_id=~"$node"} & 8192 > 0 or ais_target_state_flags{node_id=~"$node"} & 16384 > 0)
    • Description: “Critical node state alert detected”
    • Severity: Critical
  2. Create an alert for warning conditions:

    • Condition: ais_target_state_flags{node_id=~"$node"} > 0 and (ais_target_state_flags{node_id=~"$node"} & 128 > 0 or ais_target_state_flags{node_id=~"$node"} & 4096 > 0)
    • Description: “Warning node state alert detected”
    • Severity: Warning

Performance Alerts

  1. High latency alert:

    • Condition: sum(rate(ais_target_get_ns_total[5m])) / sum(rate(ais_target_get_count[5m])) / 1000000 > 200
    • Description: “GET operation latency exceeds 200ms”
  2. Error rate alert:

    • Condition: sum(rate(ais_target_err_get_count[5m])) / sum(rate(ais_target_get_count[5m])) * 100 > 1
    • Description: “GET error rate exceeds 1%“

Resource Alerts

Common alert thresholds:

MetricWarningCriticalDescription
Disk Usage85%95%Storage capacity utilization
Error Rate1%5%Operation error percentage
Node Countn-1n-2Number of online nodes vs. expected
CPU Usage70%90%CPU utilization
Memory Usage80%95%Memory utilization
Latency2x baseline5x baselineOperation latency increase

Production Best Practices

  1. Retention and Sampling

    • Configure appropriate retention periods in Prometheus
    • Use recording rules for complex queries
    • Consider downsampling for long-term storage
  2. Dashboard Organization

    • Group related metrics on the same dashboard
    • Use row dividers to organize panels
    • Add documentation links and descriptions
  3. Performance Considerations

    • Limit the number of panels per dashboard
    • Use appropriate time ranges
    • Avoid overly complex queries that could impact Prometheus
  4. High Availability

    • Deploy Prometheus with high availability for production
    • Consider using Grafana Enterprise for critical environments
    • Set up redundant alerting paths
  5. Security

    • Configure proper authentication
    • Set up appropriate user roles
    • Share dashboards with read-only permissions
    • Use TLS for all communications

Troubleshooting

No Data in Grafana

  1. Verify Prometheus data source connection:

    • Check Prometheus server health
    • Ensure connectivity between Grafana and Prometheus
    • Test connection in Data Sources section
  2. Check metrics collection:

    • Verify AIStore nodes are exposing metrics
    • Test direct access to metrics endpoint: curl http://<aistore-node>:<port>/metrics
    • Check Prometheus targets status
  3. Validate PromQL queries:

    • Test queries directly in Prometheus UI
    • Check for typos in metric names
    • Verify label selectors match your deployment

Incomplete Metrics

  1. Ensure all AIStore nodes are being scraped:

    • Check Prometheus targets
    • Verify scrape configurations
    • Check for connectivity issues
  2. Check for missing or incomplete metrics:

    • Review AIStore logs for metrics-related messages
    • Verify AIStore version compatibility with dashboards
    • Check for misconfiguration in Prometheus scrape settings

Dashboard Performance Issues

  1. Optimize queries:

    • Simplify complex queries
    • Add appropriate time ranges
    • Use recording rules for frequently used queries
  2. Reduce load:

    • Decrease dashboard refresh rate
    • Limit the number of panels per dashboard
    • Consider splitting complex dashboards

Further Resources

  • AIStore Metrics Reference
  • AIStore Prometheus Integration
  • AIStore K8s Operator
  • AIStore K8s Monitoring
  • Grafana Documentation
  • Prometheus and Grafana Best Practices
  • Advanced PromQL Queries