Kubernetes monitoring | NVIDIA AIStore

This document explains how to implement and use observability features for AIStore deployments in Kubernetes environments. Kubernetes provides additional tools and patterns for monitoring that complement AIStore’s built-in observability features.

Kubernetes Monitoring Architecture

When deployed in Kubernetes, AIStore observability typically follows this architecture:

┌────────────────────────────────────────────────────────────┐
│                     Kubernetes Cluster                     │
│                                                            │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │  AIStore    │    │ Prometheus  │    │  Grafana    │     │
│  │   Pods      │───▶│  Operator   │───▶│   Pods      │     │
│  │             │    │             │    │             │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│        │                   │                  │            │
│        │                   │                  │            │
│        ▼                   ▼                  ▼            │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │ Kubernetes  │    │ AlertManager│    │ Persistent  │     │
│  │  Metrics    │    │    Pods     │    │  Storage    │     │
│  └─────────────┘    └─────────────┘    └─────────────┘     │
│                                                            │
└────────────────────────────────────────────────────────────┘

Key components in this architecture:

AIStore Pods: Expose Prometheus metrics via the /metrics endpoint
Prometheus Operator: Manages Prometheus instances and monitoring configurations
Grafana: Provides visualization for both AIStore and Kubernetes metrics
AlertManager: Handles alert routing and notifications
Kubernetes Metrics: Standard metrics from the Kubernetes API
Persistent Storage: For long-term metrics retention and Grafana state

Prerequisites

Before setting up AIStore observability in Kubernetes, ensure you have:

A functional Kubernetes cluster (v1.30+)
AIStore deployed on the cluster
kube-prometheus-stack or its individual components:
- Prometheus Operator
- Prometheus Server
- AlertManager
- Grafana
kubectl configured to access your cluster
Helm v3+ (for chart-based installations)
Storage classes configured for persistent volumes (if using persistent storage)

NOTE: The YAML examples provided in this document are intended as reference templates that demonstrate the structure and key components required for AIStore observability in Kubernetes. These examples should be reviewed and validated by Kubernetes experts before applying to production environments. They may require adjustments based on your specific Kubernetes version, monitoring stack configuration, and AIStore deployment. API versions and specific field formats can vary between Kubernetes releases.

Deployment Methods

Method 1: Using the AIS-K8s Repository

The AIS-K8s repository provides pre-configured monitoring for AIStore. This is the recommended approach for production deployments.

$ # Clone the repository
$ git clone https://github.com/NVIDIA/ais-k8s
$ cd ais-k8s
$ 
$ # Deploy AIStore with monitoring
$ helm install ais-deployment ./helm/ais --set monitoring.enabled=true

For more detailed deployment options:

$ # View all available monitoring configuration options
$ helm show values ./helm/ais | grep -A20 monitoring
$ 
$ # Deploy with customized monitoring settings
$ helm install ais-deployment ./helm/ais \
>   --set monitoring.enabled=true \
>   --set monitoring.grafana.persistence.enabled=true \
>   --set monitoring.prometheus.retention=15d

The AIS-K8s deployment includes:

Properly configured ServiceMonitors for AIStore components
Pre-built Grafana dashboards
Default AlertManager rules
Persistent storage configuration (optional)

Method 2: Manual Configuration with Prometheus Operator

If you’re using a custom deployment or need fine-grained control, you can manually configure the monitoring stack:

Deploy Prometheus Operator using Helm:

$ # Add the Prometheus community Helm repository
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ 
$ # Create a namespace for monitoring
$ kubectl create namespace monitoring
$ 
$ # Install the kube-prometheus-stack
$ helm install prometheus prometheus-community/kube-prometheus-stack \
>   --namespace monitoring \
>   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

Create a ServiceMonitor for AIStore:

1 apiVersion: monitoring.coreos.com/v1
2 kind: ServiceMonitor
3 metadata:
4   name: ais-monitors
5   namespace: monitoring
6   labels:
7     release: prometheus  # Match the release label used by your Prometheus instance
8 spec:
9   selector:
10     matchLabels:
11       app: ais
12   namespaceSelector:
13     matchNames:
14       - ais-namespace   # Replace with your AIStore namespace
15   endpoints:
16   - port: metrics       # Must match the service port name in AIStore service definition
17     interval: 15s
18     path: /metrics
19     relabelings:
20     - action: labelmap
21       regex: __meta_kubernetes_pod_label_(.+)
22     - sourceLabels: [__meta_kubernetes_pod_name]
23       action: replace
24       targetLabel: instance

Apply the ServiceMonitor:

$ kubectl apply -f servicemonitor.yaml

Import AIStore dashboards to Grafana:
- Download dashboard JSONs from the AIS-K8s repository
- Navigate to Grafana UI > Dashboards > Import
- Upload the dashboard JSON files

Configuring AIStore for Kubernetes Monitoring

Note: The following YAML examples demonstrate the general structure but may need adjustments for your specific environment. JSON configuration within ConfigMaps should use proper JSON formatting without comments.

To ensure AIStore exposes metrics properly in Kubernetes:

Verify AIStore ConfigMap includes Prometheus configuration:

1 apiVersion: v1
2 kind: ConfigMap
3 metadata:
4   name: ais-config
5 data:
6   ais.json: |
7     {
8       "prometheus": {
9         "enabled": true,
10         "pushgateway": ""      # Leave empty for pull-based metrics
11       },
12       # Other AIStore configuration...
13     }

Check that AIStore Service definitions expose the metrics port:

1 apiVersion: v1
2 kind: Service
3 metadata:
4   name: ais-targets
5   labels:
6     app: ais
7     component: target
8 spec:
9   ports:
10   - name: metrics       # This name must match the port in ServiceMonitor
11     port: 8081
12     targetPort: 8081
13   selector:
14     app: ais
15     component: target
16 ---
17 apiVersion: v1
18 kind: Service
19 metadata:
20   name: ais-proxies
21   labels:
22     app: ais
23     component: proxy
24 spec:
25   ports:
26   - name: metrics
27     port: 8080
28     targetPort: 8080
29   selector:
30     app: ais
31     component: proxy

Verify metrics are being exposed by checking directly:

$ # Forward a port to an AIStore target pod
$ kubectl port-forward -n ais-namespace pod/ais-target-0 8081:8081
$ 
$ # In another terminal, check metrics endpoint
$ curl localhost:8081/metrics

Kubernetes-specific Metrics

When monitoring AIStore in Kubernetes, you should track both AIStore-specific metrics and Kubernetes infrastructure metrics:

Pod Metrics

Metric	Description	Use Case
`kube_pod_container_resource_usage_cpu_cores`	CPU usage by AIStore pods	Capacity planning, detect overloads
`kube_pod_container_resource_requests_cpu_cores`	CPU requested by pods	Resource allocation analysis
`kube_pod_container_resource_limits_cpu_cores`	CPU limits for pods	Resource constraint checks
`kube_pod_container_resource_usage_memory_bytes`	Memory usage by AIStore pods	Detect memory issues
`kube_pod_container_status_restarts_total`	Container restart count	Identify stability issues
`kube_pod_status_phase`	Pod status (running, pending, failed)	Monitor pod lifecycle

Volume Metrics

Metric	Description	Use Case
`kubelet_volume_stats_available_bytes`	Available volume space	Capacity planning
`kubelet_volume_stats_capacity_bytes`	Total volume capacity	Storage provisioning
`kubelet_volume_stats_used_bytes`	Used volume space	Detect storage constraints
`kubelet_volume_stats_inodes_free`	Available inodes	Detect inode exhaustion
`volume_manager_total_volumes`	Volume count	Resource monitoring

Network Metrics

Metric	Description	Use Case
`container_network_receive_bytes_total`	Network bytes received	Traffic analysis
`container_network_transmit_bytes_total`	Network bytes transmitted	Bandwidth usage
`container_network_receive_packets_total`	Network packets received	Network troubleshooting
`container_network_transmit_packets_total`	Network packets transmitted	Network troubleshooting
`container_network_receive_packets_dropped_total`	Dropped incoming packets	Detect network issues
`container_network_transmit_packets_dropped_total`	Dropped outgoing packets	Detect network issues

Node Metrics

Metric	Description	Use Case
`node_cpu_seconds_total`	CPU usage by node	Node performance
`node_memory_MemAvailable_bytes`	Available memory	Resource management
`node_filesystem_avail_bytes`	Available filesystem space	Storage health
`node_network_transmit_bytes_total`	Node network usage	Network capacity planning
`node_disk_io_time_seconds_total`	Disk I/O time	Storage performance analysis

Grafana Dashboards for Kubernetes

For effective AIStore monitoring in Kubernetes, use a combination of specialized dashboards:

1. AIStore Application Dashboard

Focus on AIStore-specific metrics:

Throughput and latency metrics (GET, PUT, DELETE operations)
Operation rates and error counts
Rebalance and resilver status
Cache hit ratios and storage utilization

2. Kubernetes Resource Dashboard

Focus on Kubernetes infrastructure:

Pod resource usage (CPU, memory) for AIStore components
Network traffic between pods and nodes
Volume usage and I/O performance
Pod restarts and health status
Node resource utilization

AIStore Kubernetes Dashboard

3. Combined Operational Dashboard

Correlate application and infrastructure metrics:

AIStore performance versus underlying resource usage
Impact of pod scheduling and restarts on operations
Storage latency versus Kubernetes volume metrics
Network throughput correlation with operation rates

Example Dashboard Import

Import the AIStore Kubernetes dashboard:

Download the dashboard JSON from the ais-k8s repository
In Grafana, navigate to Dashboards > Import
Upload the JSON file or paste its contents
Select your Prometheus data source
Customize dashboard variables if needed
Click Import

Create dashboard variables for better filtering:

namespace: AIStore namespace
pod: AIStore pod selector
node: Kubernetes node selector
interval: Time range for rate calculations

Alerting in Kubernetes

Configure Prometheus AlertManager rules for proactive monitoring:

Note: The following AlertManager configurations are examples that demonstrate structure and common alert patterns. You’ll need to customize thresholds and selectors for your environment and ensure compatibility with your Prometheus Operator version.

AlertManager Rules

Create a PrometheusRule resource for Kubernetes-specific alerts (and notice PromQL queries):

1 apiVersion: monitoring.coreos.com/v1
2 kind: PrometheusRule
3 metadata:
4   name: ais-k8s-alerts
5   namespace: monitoring
6 spec:
7   groups:
8   - name: ais.kubernetes.rules
9     rules:
10     - alert: AIStorePodRestartingFrequently
11       expr: {% raw %}rate(kube_pod_container_status_restarts_total{namespace="ais-namespace"}[15m]) > 0.2{% endraw %}
12       for: 5m
13       labels:
14         severity: warning
15       annotations:
16         summary: "AIStore pod restarting frequently"
17         description: "Pod {% raw %}{{ $labels.pod }}{% endraw %} in namespace {% raw %}{{ $labels.namespace }}{% endraw %} is restarting frequently"
18 
19     - alert: AIStoreVolumeNearlyFull
20       expr: {% raw %}kubelet_volume_stats_available_bytes{namespace="ais-namespace"} / kubelet_volume_stats_capacity_bytes{namespace="ais-namespace"} < 0.1{% endraw %}
21       for: 5m
22       labels:
23         severity: warning
24       annotations:
25         summary: "AIStore volume nearly full"
26         description: "Volume {% raw %}{{ $labels.persistentvolumeclaim }}{% endraw %} is at {% raw %}{{ $value | humanizePercentage }}{% endraw %} capacity"
27 
28     - alert: AIStoreHighNetworkTraffic
29       expr: {% raw %}sum(rate(container_network_transmit_bytes_total{namespace="ais-namespace"}[5m])) > 1e9{% endraw %}
30       for: 10m
31       labels:
32         severity: warning
33       annotations:
34         summary: "High network traffic"
35         description: "Network traffic exceeds 1GB/s for 10 minutes"
36 
37     - alert: AIStorePodNotReady
38       expr: {% raw %}sum by(namespace, pod) (kube_pod_status_phase{namespace="ais-namespace", phase=~"Pending|Unknown|Failed"}) > 0{% endraw %}
39       for: 10m
40       labels:
41         severity: critical
42       annotations:
43         summary: "AIStore pod not ready"
44         description: "Pod {% raw %}{{ $labels.pod }}{% endraw %} is in {% raw %}{{ $labels.phase }}{% endraw %} state for more than 10 minutes"
45 
46     - alert: AIStoreNodeHighLoad
47       expr: {% raw %}node_load5{instance=~".*"} / on(instance) count by(instance) (node_cpu_seconds_total{mode="system"}) > 3{% endraw %}
48       for: 10m
49       labels:
50         severity: warning
51       annotations:
52         summary: "High node load affecting AIStore"
53         description: "Node {% raw %}{{ $labels.instance }}{% endraw %} has a high load average, which may affect AIStore performance"

Alert Routing

Configure AlertManager to route notifications through appropriate channels:

1 apiVersion: monitoring.coreos.com/v1
2 kind: AlertmanagerConfig
3 metadata:
4   name: ais-alert-routing
5   namespace: monitoring
6 spec:
7   route:
8     receiver: 'ais-team-slack'
9     group_by: ['alertname', 'namespace']
10     group_wait: 30s
11     group_interval: 5m
12     repeat_interval: 12h
13     routes:
14     - matchers:
15       - name: severity
16         value: critical
17       receiver: 'ais-team-pagerduty'
18       continue: true
19   receivers:
20   - name: 'ais-team-slack'
21     slackConfigs:
22     - apiURL:
23         key: slack-url
24         name: alertmanager-slack-secret
25       channel: '#ais-alerts'
26       sendResolved: true
27   - name: 'ais-team-pagerduty'
28     pagerdutyConfigs:
29     - routingKey:
30         key: pagerduty-key
31         name: alertmanager-pagerduty-secret
32       sendResolved: true

Log Management in Kubernetes

Centralized logging is essential for troubleshooting AIStore in Kubernetes environments.

Centralized Logging Options

ELK Stack (Elasticsearch, Logstash, Kibana):

Comprehensive but resource-intensive solution
Deploy using the Elastic Operator or Helm charts
Configure Filebeat as a DaemonSet to collect container logs

$ # Add Elastic Helm repo
$ helm repo add elastic https://helm.elastic.co
$ helm repo update
$ 
$ # Install Elasticsearch
$ helm install elasticsearch elastic/elasticsearch \
>   --namespace logging \
>   --create-namespace
$ 
$ # Install Kibana
$ helm install kibana elastic/kibana \
>   --namespace logging
$ 
$ # Install Filebeat
$ helm install filebeat elastic/filebeat \
>   --namespace logging \
>   --set daemonset.enabled=true

Loki Stack:

Lighter weight than ELK
Designed to work with Prometheus and Grafana
Uses Promtail for log collection

$ # Add Grafana Helm repo
$ helm repo add grafana https://grafana.github.io/helm-charts
$ helm repo update
$ 
$ # Install Loki Stack with Promtail
$ helm install loki grafana/loki-stack \
>   --namespace logging \
>   --create-namespace \
>   --set promtail.enabled=true \
>   --set grafana.enabled=true

Fluent Bit / Fluentd:

Flexible log collectors that can send to various backends
Lower resource footprint (especially Fluent Bit)
Configure as a DaemonSet

$ # Install Fluent Bit
$ helm repo add fluent https://fluent.github.io/helm-charts
$ helm repo update
$ helm install fluent-bit fluent/fluent-bit \
>   --namespace logging \
>   --create-namespace

AIStore-specific Logging Configuration

Configure log parsing for AIStore:

For Fluent Bit, add AIStore parsing rules:

1 apiVersion: v1
2 kind: ConfigMap
3 metadata:
4   name: fluent-bit-parsers
5   namespace: logging
6 data:
7   parsers.conf: |
8     [PARSER]
9         Name        ais_json
10         Format      json
11         Time_Key    time
12         Time_Format %Y-%m-%dT%H:%M:%S.%L

For Promtail, add AIStore-specific pipeline stages:

1 apiVersion: v1
2 kind: ConfigMap
3 metadata:
4   name: promtail-config
5   namespace: logging
6 data:
7   promtail.yaml: |
8     scrape_configs:
9       - job_name: kubernetes-pods
10         pipeline_stages:
11           - json:
12               expressions:
13                 level: level
14                 msg: msg
15                 time: time
16           - labels:
17               level:

Create Grafana dashboard for AIStore logs with useful queries:

For Loki:

{% raw %}{namespace="ais-namespace"} |= "error" | json | level="error"{% endraw %}

For Elasticsearch:

{% raw %}kubernetes.namespace:"ais-namespace" AND message:error AND level:error{% endraw %}

Operational Best Practices

Resource Allocation

Ensure proper resource allocation for AIStore components in Kubernetes:

Set appropriate resource requests and limits for predictable performance:

1 resources:
2   requests:
3     cpu: 1000m
4     memory: 4Gi
5   limits:
6     cpu: 2000m
7     memory: 8Gi

Regularly monitor actual usage versus requested resources to optimize allocation:
- Use kubectl top pod to view current resource usage
- Analyze trends in Grafana dashboards
- Adjust resource requests based on historical usage patterns

Horizontal Pod Autoscaling

While AIStore targets don’t typically use HPA, proxy nodes can benefit from autoscaling:

1 apiVersion: autoscaling/v2
2 kind: HorizontalPodAutoscaler
3 metadata:
4   name: ais-proxy-hpa
5 spec:
6   scaleTargetRef:
7     apiVersion: apps/v1
8     kind: Deployment
9     name: ais-proxy
10   minReplicas: 3
11   maxReplicas: 10
12   metrics:
13   - type: Resource
14     resource:
15       name: cpu
16       target:
17         type: Utilization
18         averageUtilization: 80
19   - type: Resource
20     resource:
21       name: memory
22       target:
23         type: Utilization
24         averageUtilization: 80

Pod Disruption Budgets

Protect AIStore availability during cluster maintenance operations:

1 apiVersion: policy/v1
2 kind: PodDisruptionBudget
3 metadata:
4   name: ais-target-pdb
5 spec:
6   minAvailable: 60%
7   selector:
8     matchLabels:
9       app: ais
10       component: target

Create separate PDBs for proxy and target components to ensure cluster stability.

Readiness and Liveness Probes

Configure appropriate health checks for AIStore components:

1 readinessProbe:
2   httpGet:
3     path: /health
4     port: 8080
5   initialDelaySeconds: 30
6   periodSeconds: 10
7   timeoutSeconds: 5
8   successThreshold: 1
9   failureThreshold: 3
10 livenessProbe:
11   httpGet:
12     path: /health
13     port: 8080
14   initialDelaySeconds: 60
15   periodSeconds: 15
16   timeoutSeconds: 5
17   successThreshold: 1
18   failureThreshold: 3

Storage Configuration

For production deployments:

Use local storage for mountpaths:
- Provides better performance than network storage
- Avoids network congestion for data access
Use separate storage for metadata:
- Consider SSD/NVMe storage for metadata
- Improves metadata operations performance
Configure storage affinity:
- Keep pods on nodes with their persistent storage
- Use node selectors or pod affinity rules

Troubleshooting AIStore in Kubernetes

Common Issues and Solutions

Issue	Symptoms	Troubleshooting Commands
Pod won’t start	Pod stuck in Pending state	`kubectl describe pod <pod-name>`
Configuration issues	Pod starts but AIStore service fails	`kubectl logs <pod-name>`
Performance degradation	High latency, low throughput	`kubectl top pod`
Network connectivity	Transport errors in logs	`kubectl exec <pod-name> -- ping <target>`
Storage issues	I/O errors, disk full	`kubectl exec <pod-name> -- df -h`
Service discovery	Health check failures	`kubectl exec <pod-name> -- curl -v localhost:8080/health`
Resource starvation	OOMKilled, CPU throttling	`kubectl describe pod <pod-name>; kubectl top pod <pod-name>`

Collecting Debug Information

Script to collect comprehensive debug information:

$ #!/bin/bash
$ NAMESPACE="ais-namespace"
$ OUTPUT_DIR="ais-debug-$(date +%Y%m%d-%H%M%S)"
$ mkdir -p $OUTPUT_DIR
$ 
$ echo "Collecting AIStore debug information from namespace $NAMESPACE..."
$ 
$ # Get all resources
$ kubectl get all -n $NAMESPACE -o wide > $OUTPUT_DIR/resources.txt
$ 
$ # Get pod descriptions
$ for pod in $(kubectl get pods -n $NAMESPACE -o name); do
$   kubectl describe $pod -n $NAMESPACE > $OUTPUT_DIR/$(echo $pod | cut -d/ -f2)-describe.txt
$   echo "Collected description for $pod"
$ done
$ 
$ # Get logs with timestamps
$ for pod in $(kubectl get pods -n $NAMESPACE -o name); do
$   kubectl logs --timestamps=true $pod -n $NAMESPACE > $OUTPUT_DIR/$(echo $pod | cut -d/ -f2)-logs.txt
$   echo "Collected logs for $pod"
$ done
$ 
$ # Get configmaps
$ kubectl get configmaps -n $NAMESPACE -o yaml > $OUTPUT_DIR/configmaps.yaml
$ 
$ # Get secrets (without revealing values)
$ kubectl get secrets -n $NAMESPACE -o yaml | grep -v "data:" > $OUTPUT_DIR/secrets-metadata.yaml
$ 
$ # Get PVCs and PVs
$ kubectl get pvc -n $NAMESPACE -o yaml > $OUTPUT_DIR/pvcs.yaml
$ kubectl get pv -o yaml > $OUTPUT_DIR/pvs.yaml
$ 
$ # Get services and endpoints
$ kubectl get services -n $NAMESPACE -o yaml > $OUTPUT_DIR/services.yaml
$ kubectl get endpoints -n $NAMESPACE -o yaml > $OUTPUT_DIR/endpoints.yaml
$ 
$ # Get metrics
$ kubectl top pods -n $NAMESPACE > $OUTPUT_DIR/pod-metrics.txt
$ kubectl top nodes > $OUTPUT_DIR/node-metrics.txt
$ 
$ # Get events sorted by time
$ kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' > $OUTPUT_DIR/events.txt
$ 
$ # Get node information
$ kubectl describe nodes > $OUTPUT_DIR/nodes.txt
$ 
$ # Collect prometheus metrics if available
$ for pod in $(kubectl get pods -n $NAMESPACE -l app=ais -o name); do
$   podname=$(echo $pod | cut -d/ -f2)
$   port=$(kubectl get pod $podname -n $NAMESPACE -o jsonpath='{.spec.containers[0].ports[?(@.name=="metrics")].containerPort}')
$   if [ ! -z "$port" ]; then
$     kubectl port-forward -n $NAMESPACE $pod $port:$port > /dev/null 2>&1 &
$     pid=$!
$     sleep 2
$     curl -s localhost:$port/metrics > $OUTPUT_DIR/$podname-metrics.txt
$     kill $pid
$     echo "Collected metrics for $pod"
$   fi
$ done
$ 
$ echo "Debug information collected in $OUTPUT_DIR"

Analyzing Issues with AIStore Metrics

To correlate issues with metrics:

Check related Prometheus metrics:

$ # Port forward to Prometheus
$ kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
$ 
$ # Open in browser: http://localhost:9090

Search for specific metrics:
- Target metrics: ais_target_*
- Proxy metrics: ais_proxy_*
- Node state: ais_target_state_flags
- Error counters: ais_target_err_*

Use PromQL for advanced analysis:

# Check for correlated spikes
rate(ais_target_get_ns_total[5m]) / rate(ais_target_get_count[5m])
# Compare against node metrics
rate(node_cpu_seconds_total{mode="user"}[5m])

Document	Description
Overview	Introduction to AIS observability
CLI	Command-line monitoring tools
Logs	Configuring, accessing, and utilizing AIS logs
Prometheus	Configuring Prometheus with AIS
Metrics Reference	Complete metrics catalog
Grafana	Visualizing AIS metrics with Grafana

AIStore Kubernetes Observability

Table of Contents

Kubernetes Monitoring Architecture

Prerequisites

Deployment Methods

Method 1: Using the AIS-K8s Repository

Method 2: Manual Configuration with Prometheus Operator

Configuring AIStore for Kubernetes Monitoring

Kubernetes-specific Metrics

Pod Metrics

Volume Metrics

Network Metrics

Node Metrics

Grafana Dashboards for Kubernetes

1. AIStore Application Dashboard

2. Kubernetes Resource Dashboard

3. Combined Operational Dashboard

Example Dashboard Import

Alerting in Kubernetes

AlertManager Rules

Alert Routing

Log Management in Kubernetes

Centralized Logging Options

AIStore-specific Logging Configuration

Operational Best Practices

Resource Allocation

Horizontal Pod Autoscaling

Pod Disruption Budgets

Readiness and Liveness Probes

Storage Configuration

Troubleshooting AIStore in Kubernetes

Common Issues and Solutions

Collecting Debug Information

Analyzing Issues with AIStore Metrics

Further Reading

Table of Contents

Kubernetes Monitoring Architecture

Prerequisites

Deployment Methods

Method 1: Using the AIS-K8s Repository

Method 2: Manual Configuration with Prometheus Operator

Configuring AIStore for Kubernetes Monitoring

Kubernetes-specific Metrics

Pod Metrics

Volume Metrics

Network Metrics

Node Metrics

Grafana Dashboards for Kubernetes

1. AIStore Application Dashboard

2. Kubernetes Resource Dashboard

3. Combined Operational Dashboard

Example Dashboard Import

Alerting in Kubernetes

AlertManager Rules

Alert Routing

Log Management in Kubernetes

Centralized Logging Options

AIStore-specific Logging Configuration

Operational Best Practices

Resource Allocation

Horizontal Pod Autoscaling

Pod Disruption Budgets

Readiness and Liveness Probes

Storage Configuration

Troubleshooting AIStore in Kubernetes

Common Issues and Solutions

Collecting Debug Information

Analyzing Issues with AIStore Metrics

Further Reading

Related Observability Documentation