AIStore Kubernetes Observability
This document explains how to implement and use observability features for AIStore deployments in Kubernetes environments. Kubernetes provides additional tools and patterns for monitoring that complement AIStore’s built-in observability features.
Table of Contents
- Kubernetes Monitoring Architecture
- Prerequisites
- Deployment Methods
- Configuring AIStore for Kubernetes Monitoring
- Kubernetes-specific Metrics
- Grafana Dashboards for Kubernetes
- Alerting in Kubernetes
- Log Management in Kubernetes
- Operational Best Practices
- Troubleshooting AIStore in Kubernetes
- Further Reading
- Related Observability Documentation
Kubernetes Monitoring Architecture
When deployed in Kubernetes, AIStore observability typically follows this architecture:
Key components in this architecture:
- AIStore Pods: Expose Prometheus metrics via the
/metricsendpoint - Prometheus Operator: Manages Prometheus instances and monitoring configurations
- Grafana: Provides visualization for both AIStore and Kubernetes metrics
- AlertManager: Handles alert routing and notifications
- Kubernetes Metrics: Standard metrics from the Kubernetes API
- Persistent Storage: For long-term metrics retention and Grafana state
Prerequisites
Before setting up AIStore observability in Kubernetes, ensure you have:
- A functional Kubernetes cluster (v1.30+)
- AIStore deployed on the cluster
- kube-prometheus-stack or its individual components:
- Prometheus Operator
- Prometheus Server
- AlertManager
- Grafana
kubectlconfigured to access your cluster- Helm v3+ (for chart-based installations)
- Storage classes configured for persistent volumes (if using persistent storage)
NOTE: The YAML examples provided in this document are intended as reference templates that demonstrate the structure and key components required for AIStore observability in Kubernetes. These examples should be reviewed and validated by Kubernetes experts before applying to production environments. They may require adjustments based on your specific Kubernetes version, monitoring stack configuration, and AIStore deployment. API versions and specific field formats can vary between Kubernetes releases.
Deployment Methods
Method 1: Using the AIS-K8s Repository
The AIS-K8s repository provides pre-configured monitoring for AIStore. This is the recommended approach for production deployments.
For more detailed deployment options:
The AIS-K8s deployment includes:
- Properly configured ServiceMonitors for AIStore components
- Pre-built Grafana dashboards
- Default AlertManager rules
- Persistent storage configuration (optional)
Method 2: Manual Configuration with Prometheus Operator
If you’re using a custom deployment or need fine-grained control, you can manually configure the monitoring stack:
- Deploy Prometheus Operator using Helm:
- Create a ServiceMonitor for AIStore:
- Apply the ServiceMonitor:
- Import AIStore dashboards to Grafana:
- Download dashboard JSONs from the AIS-K8s repository
- Navigate to Grafana UI > Dashboards > Import
- Upload the dashboard JSON files
Configuring AIStore for Kubernetes Monitoring
Note: The following YAML examples demonstrate the general structure but may need adjustments for your specific environment. JSON configuration within ConfigMaps should use proper JSON formatting without comments.
To ensure AIStore exposes metrics properly in Kubernetes:
- Verify AIStore ConfigMap includes Prometheus configuration:
- Check that AIStore Service definitions expose the metrics port:
- Verify metrics are being exposed by checking directly:
Kubernetes-specific Metrics
When monitoring AIStore in Kubernetes, you should track both AIStore-specific metrics and Kubernetes infrastructure metrics:
Pod Metrics
Volume Metrics
Network Metrics
Node Metrics
Grafana Dashboards for Kubernetes
For effective AIStore monitoring in Kubernetes, use a combination of specialized dashboards:
1. AIStore Application Dashboard
Focus on AIStore-specific metrics:
- Throughput and latency metrics (GET, PUT, DELETE operations)
- Operation rates and error counts
- Rebalance and resilver status
- Cache hit ratios and storage utilization
2. Kubernetes Resource Dashboard
Focus on Kubernetes infrastructure:
- Pod resource usage (CPU, memory) for AIStore components
- Network traffic between pods and nodes
- Volume usage and I/O performance
- Pod restarts and health status
- Node resource utilization

3. Combined Operational Dashboard
Correlate application and infrastructure metrics:
- AIStore performance versus underlying resource usage
- Impact of pod scheduling and restarts on operations
- Storage latency versus Kubernetes volume metrics
- Network throughput correlation with operation rates
Example Dashboard Import
Import the AIStore Kubernetes dashboard:
- Download the dashboard JSON from the ais-k8s repository
- In Grafana, navigate to Dashboards > Import
- Upload the JSON file or paste its contents
- Select your Prometheus data source
- Customize dashboard variables if needed
- Click Import
Create dashboard variables for better filtering:
namespace: AIStore namespacepod: AIStore pod selectornode: Kubernetes node selectorinterval: Time range for rate calculations
Alerting in Kubernetes
Configure Prometheus AlertManager rules for proactive monitoring:
Note: The following AlertManager configurations are examples that demonstrate structure and common alert patterns. You’ll need to customize thresholds and selectors for your environment and ensure compatibility with your Prometheus Operator version.
AlertManager Rules
Create a PrometheusRule resource for Kubernetes-specific alerts (and notice PromQL queries):
Alert Routing
Configure AlertManager to route notifications through appropriate channels:
Log Management in Kubernetes
Centralized logging is essential for troubleshooting AIStore in Kubernetes environments.
Centralized Logging Options
-
ELK Stack (Elasticsearch, Logstash, Kibana):
- Comprehensive but resource-intensive solution
- Deploy using the Elastic Operator or Helm charts
- Configure Filebeat as a DaemonSet to collect container logs
-
Loki Stack:
- Lighter weight than ELK
- Designed to work with Prometheus and Grafana
- Uses Promtail for log collection
-
Fluent Bit / Fluentd:
- Flexible log collectors that can send to various backends
- Lower resource footprint (especially Fluent Bit)
- Configure as a DaemonSet
AIStore-specific Logging Configuration
Configure log parsing for AIStore:
- For Fluent Bit, add AIStore parsing rules:
- For Promtail, add AIStore-specific pipeline stages:
- Create Grafana dashboard for AIStore logs with useful queries:
For Loki:
For Elasticsearch:
Operational Best Practices
Resource Allocation
Ensure proper resource allocation for AIStore components in Kubernetes:
- Set appropriate resource requests and limits for predictable performance:
- Regularly monitor actual usage versus requested resources to optimize allocation:
- Use
kubectl top podto view current resource usage - Analyze trends in Grafana dashboards
- Adjust resource requests based on historical usage patterns
- Use
Horizontal Pod Autoscaling
While AIStore targets don’t typically use HPA, proxy nodes can benefit from autoscaling:
Pod Disruption Budgets
Protect AIStore availability during cluster maintenance operations:
Create separate PDBs for proxy and target components to ensure cluster stability.
Readiness and Liveness Probes
Configure appropriate health checks for AIStore components:
Storage Configuration
For production deployments:
-
Use local storage for mountpaths:
- Provides better performance than network storage
- Avoids network congestion for data access
-
Use separate storage for metadata:
- Consider SSD/NVMe storage for metadata
- Improves metadata operations performance
-
Configure storage affinity:
- Keep pods on nodes with their persistent storage
- Use node selectors or pod affinity rules
Troubleshooting AIStore in Kubernetes
Common Issues and Solutions
Collecting Debug Information
Script to collect comprehensive debug information:
Analyzing Issues with AIStore Metrics
To correlate issues with metrics:
-
Check related Prometheus metrics:
-
Search for specific metrics:
- Target metrics:
ais_target_* - Proxy metrics:
ais_proxy_* - Node state:
ais_target_state_flags - Error counters:
ais_target_err_*
- Target metrics:
-
Use PromQL for advanced analysis:
Further Reading
- AIStore K8s Repository
- Prometheus Operator Documentation
- Kubernetes Monitoring Best Practices
- Grafana Loki Documentation
- Fluent Bit Kubernetes Documentation
- AIStore K8s Helm Chart Documentation