For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
This document explains how to implement and use observability features for AIStore deployments in Kubernetes environments. Kubernetes provides additional tools and patterns for monitoring that complement AIStore’s built-in observability features.
Storage classes configured for persistent volumes (if using persistent storage)
NOTE: The YAML examples provided in this document are intended as reference templates that demonstrate the structure and key components required for AIStore observability in Kubernetes. These examples should be reviewed and validated by Kubernetes experts before applying to production environments. They may require adjustments based on your specific Kubernetes version, monitoring stack configuration, and AIStore deployment. API versions and specific field formats can vary between Kubernetes releases.
Deployment Methods
Method 1: Using the AIS-K8s Repository
The AIS-K8s repository provides pre-configured monitoring for AIStore. This is the recommended approach for production deployments.
release: prometheus # Match the release label used by your Prometheus instance
8
spec:
9
selector:
10
matchLabels:
11
app: ais
12
namespaceSelector:
13
matchNames:
14
- ais-namespace # Replace with your AIStore namespace
15
endpoints:
16
- port: metrics # Must match the service port name in AIStore service definition
17
interval: 15s
18
path: /metrics
19
relabelings:
20
- action: labelmap
21
regex: __meta_kubernetes_pod_label_(.+)
22
- sourceLabels: [__meta_kubernetes_pod_name]
23
action: replace
24
targetLabel: instance
Apply the ServiceMonitor:
$
kubectl apply -f servicemonitor.yaml
Import AIStore dashboards to Grafana:
Download dashboard JSONs from the AIS-K8s repository
Navigate to Grafana UI > Dashboards > Import
Upload the dashboard JSON files
Configuring AIStore for Kubernetes Monitoring
Note: The following YAML examples demonstrate the general structure but may need adjustments for your specific environment. JSON configuration within ConfigMaps should use proper JSON formatting without comments.
To ensure AIStore exposes metrics properly in Kubernetes:
Verify AIStore ConfigMap includes Prometheus configuration:
1
apiVersion: v1
2
kind: ConfigMap
3
metadata:
4
name: ais-config
5
data:
6
ais.json: |
7
{
8
"prometheus": {
9
"enabled": true,
10
"pushgateway": "" # Leave empty for pull-based metrics
11
},
12
# Other AIStore configuration...
13
}
Check that AIStore Service definitions expose the metrics port:
1
apiVersion: v1
2
kind: Service
3
metadata:
4
name: ais-targets
5
labels:
6
app: ais
7
component: target
8
spec:
9
ports:
10
- name: metrics # This name must match the port in ServiceMonitor
11
port: 8081
12
targetPort: 8081
13
selector:
14
app: ais
15
component: target
16
---
17
apiVersion: v1
18
kind: Service
19
metadata:
20
name: ais-proxies
21
labels:
22
app: ais
23
component: proxy
24
spec:
25
ports:
26
- name: metrics
27
port: 8080
28
targetPort: 8080
29
selector:
30
app: ais
31
component: proxy
Verify metrics are being exposed by checking directly:
Configure Prometheus AlertManager rules for proactive monitoring:
Note: The following AlertManager configurations are examples that demonstrate structure and common alert patterns. You’ll need to customize thresholds and selectors for your environment and ensure compatibility with your Prometheus Operator version.
AlertManager Rules
Create a PrometheusRule resource for Kubernetes-specific alerts (and notice PromQL queries):
1
apiVersion: monitoring.coreos.com/v1
2
kind: PrometheusRule
3
metadata:
4
name: ais-k8s-alerts
5
namespace: monitoring
6
spec:
7
groups:
8
- name: ais.kubernetes.rules
9
rules:
10
- alert: AIStorePodRestartingFrequently
11
expr: {% raw %}rate(kube_pod_container_status_restarts_total{namespace="ais-namespace"}[15m]) > 0.2{% endraw %}
12
for: 5m
13
labels:
14
severity: warning
15
annotations:
16
summary: "AIStore pod restarting frequently"
17
description: "Pod {% raw %}{{ $labels.pod }}{% endraw %} in namespace {% raw %}{{ $labels.namespace }}{% endraw %} is restarting frequently"
18
19
- alert: AIStoreVolumeNearlyFull
20
expr: {% raw %}kubelet_volume_stats_available_bytes{namespace="ais-namespace"} / kubelet_volume_stats_capacity_bytes{namespace="ais-namespace"} < 0.1{% endraw %}
21
for: 5m
22
labels:
23
severity: warning
24
annotations:
25
summary: "AIStore volume nearly full"
26
description: "Volume {% raw %}{{ $labels.persistentvolumeclaim }}{% endraw %} is at {% raw %}{{ $value | humanizePercentage }}{% endraw %} capacity"
27
28
- alert: AIStoreHighNetworkTraffic
29
expr: {% raw %}sum(rate(container_network_transmit_bytes_total{namespace="ais-namespace"}[5m])) > 1e9{% endraw %}
30
for: 10m
31
labels:
32
severity: warning
33
annotations:
34
summary: "High network traffic"
35
description: "Network traffic exceeds 1GB/s for 10 minutes"