Storage for Model Caching on AKS

View as Markdown

Storage for Model Caching on AKS

For implementing tiered storage on AKS, you can take advantage of the different storage options available in Azure. This guide covers choosing the right storage for each Dynamo cache type and configuring PVCs.

Available Storage Options

Storage OptionPerformanceBest For
Local CSI (Ephemeral Disk)Very highFast model caching, warm restarts
Azure Managed LustreExtremely highLarge multi-node models, shared cache
Azure Disk (Managed Disk)HighPersistent single-writer model cache
Azure FilesMediumShared small/medium models
Azure Blob (via Fuse or init)Low-MediumCold model storage, bootstrap downloads

Azure Managed Lustre and Local CSI (ephemeral disk) are not installed by default in AKS and require additional setup before use. Azure Disk, Azure Files, and Azure Blob CSI drivers are available out of the box. See the Azure Lustre CSI Driver guide for Lustre setup, or the AKS CSI storage options documentation for a full overview of built-in drivers.

For Azure Managed Lustre setup, see the Azure Lustre CSI Driver guide.

Recommendations by Cache Type

  • Model Cache — raw model artifacts, configuration files, tokenizers, etc.

    • Persistence: Required to avoid repeated downloads and reduce cold-start latency.
    • Recommended storage: Azure Managed Lustre (shared, high throughput) or Azure Disk (single-replica, persistent).
  • Compilation Cache — backend-specific compiled artifacts (e.g., TensorRT engines).

    • Persistence: Optional.
    • Recommended storage: Local CSI (fast, node-local) or Azure Disk (persistent when GPU configuration is fixed).
  • Performance Cache — runtime tuning and profiling data.

    • Persistence: Not required.
    • Recommended storage: Local CSI (or other ephemeral storage).

Check Available Storage Classes

List the storage classes available in your AKS cluster:

$kubectl get storageclass
$
$NAME PROVISIONER RECLAIMPOLICY
$azureblob-csi blob.csi.azure.com Delete
$azurefile file.csi.azure.com Delete
$azurefile-csi file.csi.azure.com Delete
$azurefile-csi-premium file.csi.azure.com Delete
$azurefile-premium file.csi.azure.com Delete
$default disk.csi.azure.com Delete
$managed disk.csi.azure.com Delete
$managed-csi disk.csi.azure.com Delete
$managed-csi-premium disk.csi.azure.com Delete
$managed-premium disk.csi.azure.com Delete
$sc.azurelustre.csi.azure.com azurelustre.csi.azure.com Retain

Example PVC Configuration

In the cache.yaml in the different recipes, you can set the storageClassName to a storage option available in your AKS cluster:

1apiVersion: v1
2kind: PersistentVolumeClaim
3metadata:
4 name: model-cache
5spec:
6 accessModes:
7 - ReadWriteMany
8 resources:
9 requests:
10 storage: 100Gi
11 storageClassName: "sc.azurelustre.csi.azure.com"
12---
13apiVersion: v1
14kind: PersistentVolumeClaim
15metadata:
16 name: compilation-cache
17spec:
18 accessModes:
19 - ReadWriteMany
20 resources:
21 requests:
22 storage: 50Gi
23 storageClassName: "azurefile-csi"
24---
25apiVersion: v1
26kind: PersistentVolumeClaim
27metadata:
28 name: perf-cache
29spec:
30 accessModes:
31 - ReadWriteMany
32 resources:
33 requests:
34 storage: 50Gi
35 storageClassName: "local-ephemeral"

See Also