For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
      • Model Caching
      • ModelExpress
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • When to Use It
  • How It Works
  • Configure the Platform
  • Configure vLLM Workers
  • Stream Without Shared Storage
  • Stream From Object Storage
  • See Also
Kubernetes DeploymentModel Loading

ModelExpress

Speed up model weight distribution across Kubernetes workers
||View as Markdown|
Previous

Model Caching

Next

Autoscaling

ModelExpress is a model weight distribution service for faster worker startup in larger Dynamo clusters. Instead of every worker downloading the full model from storage, one worker can publish model weight availability and later workers can pull compatible tensors from that source over NIXL/RDMA. ModelExpress can also pair with ModelStreamer to stream safetensors directly from object storage into GPU memory.

Use ModelExpress when model rollout time, autoscale cold start, or fleet-wide model updates matter more than the simplicity of a shared PVC. For smaller clusters, start with Model Caching.

When to Use It

ScenarioRecommended path
Small cluster or first deploymentModel Caching with PVC + download Job
Large cluster with many replicasModelExpress P2P distribution
Models already on shared storagePVC or shared filesystem path
Models in S3, GCS, Azure Blob Storage, or local safetensors pathsModelExpress + ModelStreamer
Frequent model updates across a fleetModelExpress P2P, optionally seeded by ModelStreamer
ModelExpress server has non-shared storageModelExpress with MODEL_EXPRESS_NO_SHARED_STORAGE=1

How It Works

  1. A ModelExpress server runs in the cluster and stores metadata for available model sources.
  2. vLLM workers use the ModelExpress loader (--load-format mx on newer images, or mx-source / mx-target on older split-loader images).
  3. If a compatible source worker is already serving the model, a new worker pulls model tensors from that source over NIXL/RDMA.
  4. If no source is available, the worker falls back to storage. With ModelStreamer, the first worker can stream safetensors from s3://, gs://, az://, or a local path.
  5. The Kubernetes operator can inject MODEL_EXPRESS_URL into all Dynamo pods from the platform modelExpressURL setting.

Configure the Platform

Set the ModelExpress server URL when installing the Dynamo platform:

$helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
> --namespace ${NAMESPACE} \
> --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"

If the ModelExpress server is installed separately, point dynamo-operator.modelExpressURL at that service. The operator injects the value into worker pods as MODEL_EXPRESS_URL.

Configure vLLM Workers

Use a runtime image that includes the modelexpress Python package. For ModelStreamer, the image also needs runai-model-streamer and the relevant object-storage SDK dependencies.

1services:
2 VllmWorker:
3 extraPodSpec:
4 mainContainer:
5 image: <vllm-runtime-image-with-modelexpress>
6 command: ["python3", "-m", "dynamo.vllm"]
7 args:
8 - --model
9 - meta-llama/Llama-3.1-70B-Instruct
10 - --load-format
11 - mx
12 env:
13 - name: VLLM_PLUGINS
14 value: modelexpress

Use the load format supported by your runtime image. ModelExpress v0.3 and newer document the unified mx loader. Some older Dynamo images expose mx-source and mx-target loader names instead.

Stream Without Shared Storage

If the ModelExpress server cache is on a non-shared volume, workers cannot read the server’s local cache path. Set MODEL_EXPRESS_NO_SHARED_STORAGE=1 on worker pods so the client streams model files from the server over gRPC:

1services:
2 VllmWorker:
3 extraPodSpec:
4 mainContainer:
5 env:
6 - name: VLLM_PLUGINS
7 value: modelexpress
8 - name: MODEL_EXPRESS_NO_SHARED_STORAGE
9 value: "1"

Use this path when the server has an RWO PVC, runs in a different namespace, or the cluster has no RDMA fabric available. Shared-filesystem mode is still faster when available.

Stream From Object Storage

Set MX_MODEL_URI when the first worker should stream safetensors directly from object storage or a local mounted path:

1services:
2 VllmWorker:
3 extraPodSpec:
4 mainContainer:
5 image: <vllm-runtime-image-with-modelexpress-and-modelstreamer>
6 command: ["python3", "-m", "dynamo.vllm"]
7 args:
8 - --model
9 - meta-llama/Llama-3.1-70B-Instruct
10 - --load-format
11 - mx
12 env:
13 - name: VLLM_PLUGINS
14 value: modelexpress
15 - name: MX_MODEL_URI
16 value: s3://my-model-bucket/meta-llama/Llama-3.1-70B-Instruct
17 - name: RUNAI_STREAMER_CONCURRENCY
18 value: "8"
Storage backendMX_MODEL_URI exampleCredential options
S3 or S3-compatible storages3://bucket/path/to/modelIRSA / workload identity, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_DEFAULT_REGION, optional AWS_ENDPOINT_URL
Google Cloud Storagegs://bucket/path/to/modelGKE Workload Identity, Application Default Credentials, or GOOGLE_APPLICATION_CREDENTIALS
Azure Blob Storageaz://container/path/to/modelManaged Identity, service principal env vars, or AZURE_ACCOUNT_NAME / AZURE_ACCOUNT_KEY
Local filesystem or PVC/models/meta-llama/Llama-3.1-70B-InstructMount the path into the worker pod

Credentials are consumed by the storage SDKs in the worker pod. They do not flow through the ModelExpress server.

See Also

  • Model Caching - simple PVC-based model caching and the longer ModelExpress background.
  • ModelExpress deployment guide - server, P2P, and ModelStreamer configuration.
  • Installation Guide - Dynamo platform install options, including modelExpressURL.