Model Caching | NVIDIA Dynamo Documentation

Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports a simple shared-storage path and a ModelExpress path for faster weight distribution across larger clusters.

Option 1: PVC + Download Job (Recommended)

The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.

This is the pattern used by all Dynamo recipes today.

Step 1: Create a Shared PVC

1 apiVersion: v1
2 kind: PersistentVolumeClaim
3 metadata:
4   name: model-cache
5 spec:
6   accessModes:
7     - ReadWriteMany
8   resources:
9     requests:
10       storage: 100Gi

ReadWriteMany access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems).

Step 2: Download the model

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: model-download
5 spec:
6   template:
7     spec:
8       restartPolicy: Never
9       containers:
10         - name: downloader
11           image: python:3.12-slim
12           command: ["sh", "-c"]
13           args:
14             - |
15               pip install huggingface_hub hf_transfer
16               HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
17                 $MODEL_NAME --revision $MODEL_REVISION
18           env:
19             - name: MODEL_NAME
20               value: "Qwen/Qwen3-0.6B"
21             - name: MODEL_REVISION
22               value: "main"
23             - name: HF_HOME
24               value: /cache/huggingface
25           envFrom:
26             - secretRef:
27                 name: hf-token-secret
28           volumeMounts:
29             - name: model-cache
30               mountPath: /cache/huggingface
31       volumes:
32         - name: model-cache
33           persistentVolumeClaim:
34             claimName: model-cache

Find the Snapshot Path

After the Job completes, the model is stored in HuggingFace’s cache layout:

hub/models--<org>--<model>/snapshots/<commit-hash>/

For example, meta-llama/Llama-3.1-70B-Instruct becomes:

hub/models--meta-llama--Llama-3.1-70B-Instruct/snapshots/9d3b8e0f71f8c1e0f9b7c2a3d4e5f6a7b8c9d0e1/

To find the exact commit hash after the download Job completes:

$ kubectl run find-snapshot --rm -it --image=busybox --restart=Never \
>   --overrides='{
>     "spec": {
>       "volumes": [{"name": "c", "persistentVolumeClaim": {"claimName": "model-cache"}}],
>       "containers": [{
>         "name": "f", "image": "busybox",
>         "command": ["find", "/c/hub", "-mindepth", "3", "-maxdepth", "3", "-type", "d"],
>         "volumeMounts": [{"name": "c", "mountPath": "/c"}]
>       }]
>     }
>   }'

Alternatively, look up the commit hash on the HuggingFace Hub model page under Files and versions.

You need this path for the pvcModelPath field in a DGDR spec (see Model Deployment Guide — Model Caching).

Step 3: Mount in DynamoGraphDeployment

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: my-deployment
5 spec:
6   pvcs:
7     - create: false
8       name: model-cache
9   services:
10     VllmWorker:
11       volumeMounts:
12         - name: model-cache
13           mountPoint: /home/dynamo/.cache/huggingface

All VllmWorker pods that mount model-cache now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.

Compilation Cache

For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:

1 spec:
2   pvcs:
3     - create: false
4       name: model-cache
5     - create: false
6       name: compilation-cache
7   services:
8     VllmWorker:
9       volumeMounts:
10         - name: model-cache
11           mountPoint: /home/dynamo/.cache/huggingface
12         - name: compilation-cache
13           mountPoint: /home/dynamo/.cache/vllm

Option 2: ModelExpress (P2P Distribution)

ModelExpress is a model weight distribution service that integrates with vLLM’s weight loading pipeline. It can publish model weights from one worker and let later workers pull those tensors from GPU memory over NIXL/RDMA instead of repeating a full storage download.

ModelExpress can also use ModelStreamer as a loading strategy. ModelStreamer streams safetensors directly from object storage or a local filesystem path into GPU memory through the runai-model-streamer package. In that setup, the first worker can stream from storage and then publish ModelExpress metadata so later workers can use the P2P path.

Use this path when startup time or fleet-wide model rollout time matters more than the simplicity of a shared PVC.

How It Works

A ModelExpress server runs in the cluster and stores metadata for available sources.
vLLM workers use the ModelExpress loader (--load-format mx on newer ModelExpress images, or mx-source / mx-target on older split-loader images).
If a compatible source is already serving the model, a new worker pulls model tensors from that source over NIXL/RDMA.
If no source is available, the worker falls back to storage. When MX_MODEL_URI is set, ModelStreamer can stream safetensors from S3, GCS, Azure Blob Storage, or a local path.
The Kubernetes operator can inject MODEL_EXPRESS_URL into all Dynamo pods from the platform modelExpressURL setting.

What To Configure

Layer	What to configure	Notes
Runtime image	Include the `modelexpress` Python package and, for ModelStreamer, `runai-model-streamer` plus the object-storage dependencies.	Dynamo or vLLM raises an import error if the worker uses a ModelExpress load format but the package is missing.
ModelExpress server	Deploy the server with Redis or Kubernetes CRD metadata backend.	See the ModelExpress deployment guide.
Dynamo platform	Set `dynamo-operator.modelExpressURL`.	The operator injects `MODEL_EXPRESS_URL` into pods.
vLLM worker	Set the ModelExpress load format and point at the server.	Newer ModelExpress images use `--load-format mx`; older Dynamo images may use `mx-source` / `mx-target`.
ModelStreamer	Set `MX_MODEL_URI` to the storage location.	Supported URI forms include `s3://...`, `gs://...`, `az://...`, an absolute local path, or a Hugging Face model ID resolved from the local cache.

Setup

Install with Dynamo Platform:

$ helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
>   --namespace ${NAMESPACE} \
>   --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"

Configure workers to use ModelExpress:

1 services:
2   VllmWorker:
3     extraPodSpec:
4       mainContainer:
5         image: <vllm-runtime-image-with-modelexpress>
6         command: ["python3", "-m", "dynamo.vllm"]
7         args:
8           - --model
9           - meta-llama/Llama-3.1-70B-Instruct
10           - --load-format
11           - mx
12           - --model-express-url
13           - http://model-express-server.model-express.svc.cluster.local:8080
14         env:
15           - name: VLLM_PLUGINS
16             value: modelexpress

When MODEL_EXPRESS_URL is configured in the operator, it is automatically injected as an environment variable into all component pods. Passing --model-express-url explicitly is still useful in examples because the worker validates that a server URL is available when using the older mx-source / mx-target load formats.

Use the load format supported by your runtime image. ModelExpress v0.3 and newer document the unified mx loader. Some Dynamo images still expose the older split mx-source and mx-target loader names; those require the same server URL but separate source and target roles.

ModelStreamer From Object Storage

Set MX_MODEL_URI when the first worker should stream safetensors directly from storage instead of reading a PVC or relying on a prior source worker.

1 services:
2   VllmWorker:
3     extraPodSpec:
4       mainContainer:
5         image: <vllm-runtime-image-with-modelexpress-and-modelstreamer>
6         command: ["python3", "-m", "dynamo.vllm"]
7         args:
8           - --model
9           - meta-llama/Llama-3.1-70B-Instruct
10           - --load-format
11           - mx
12           - --model-express-url
13           - http://model-express-server.model-express.svc.cluster.local:8080
14         env:
15           - name: VLLM_PLUGINS
16             value: modelexpress
17           - name: MX_MODEL_URI
18             value: s3://my-model-bucket/meta-llama/Llama-3.1-70B-Instruct
19           - name: RUNAI_STREAMER_CONCURRENCY
20             value: "8"

ModelStreamer relies on the underlying cloud SDK credentials:

Storage backend	`MX_MODEL_URI` example	Credential options
S3 or S3-compatible storage	`s3://bucket/path/to/model`	IRSA / workload identity, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_DEFAULT_REGION`, and optional `AWS_ENDPOINT_URL`
Google Cloud Storage	`gs://bucket/path/to/model`	GKE Workload Identity, Application Default Credentials, or `GOOGLE_APPLICATION_CREDENTIALS`
Azure Blob Storage	`az://container/path/to/model`	Managed Identity, service principal env vars, or `AZURE_ACCOUNT_NAME` / `AZURE_ACCOUNT_KEY`
Local filesystem or PVC	`/models/meta-llama/Llama-3.1-70B-Instruct`	Mount the path into the worker pod

Credentials are consumed by the storage SDKs in the worker pod. They do not flow through the ModelExpress server.

Relationship To Shadow Engine Failover

ModelExpress and ModelStreamer are model loading and distribution paths. They are not required for Shadow Engine Failover, and enabling them does not create standby engines.

Use Shadow Engine Failover only when you specifically need an active/shadow recovery topology backed by GPU Memory Service (GMS), DRA, and a backend load format such as --load-format gms. Keep the ModelExpress / ModelStreamer configuration separate unless you have validated a combined workflow for your runtime image and cluster.

When to Use ModelExpress

Scenario	Recommended Approach
Small cluster, simple setup	PVC + Download Job
Large cluster, many nodes	ModelExpress P2P
Models already on shared storage (NFS)	PVC
Models in S3, GCS, Azure Blob Storage, or local safetensors paths	ModelExpress + ModelStreamer
Frequent model updates across fleet	ModelExpress P2P, optionally seeded by ModelStreamer