⚠️ Experimental Feature: Shadow Engine Failover is an opt-in preview feature. It depends on GPU Memory Service (GMS), Dynamic Resource Allocation (DRA), and backend-specific support. Its API shape and behavior may change, and the failover state machine is still settling. Use it only for non-production evaluation unless you have validated the exact backend, topology, and failure mode in your cluster.
Use Shadow Engine Failover when you want a standby engine to take over after an unknown backend engine or software-process failure while the GPU and node remain healthy. The goal is to avoid paying a full model weight reload after a same-node process failure.
Shadow Engine Failover is the Kubernetes workflow. GPU Memory Service is the enabling mechanism underneath it: GMS owns the GPU-resident model weights, and the active and standby engines attach to those weights through DRA.
This is separate from Dynamo Snapshot. Snapshot captures and
restores a process image with CRIU and cuda-checkpoint. Shadow Engine Failover
keeps model weights resident in GPU memory so a standby or replacement engine
can attach after selected process-level failures. They both target recovery
latency, but they solve different problems and are not interchangeable.
The following diagram illustrates same-node process-level recovery:
How it works:
GMS moves ownership of GPU-resident model weights out of the engine process and into a separate GPU memory service. In the failover workflow, this lets the active and standby engines share the same weight memory boundary instead of loading independent copies.
Direct GMS enablement is useful for backend integration testing and
sleep/wake-style lifecycle experiments. By itself, it does not configure
active/passive failover; use the failover field for the shadow engine flow.
DeviceClass, defaulting to gpu.nvidia.com.--load-format gms.experimental.For v1alpha1 DynamoGraphDeployment, GMS and failover are service-level
fields:
For v1beta1, preview fields are grouped under experimental to make the
stability contract explicit:
See the API reference for the exact schema supported by your CRD version.
Failover builds on GMS. In intra-pod mode, the operator clones the worker’s main container into active and standby engine containers that share GPUs through DRA and the GMS sidecar. The standby engine takes over when the active engine fails.
See the vLLM failover example for the full manifest.
The worker must request GPUs through the normal Dynamo service resources, enable
gpuMemoryService, and run a backend command that can load from GMS.
Working GMS-only examples: