Snapshot
⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.
Dynamo Snapshot is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA’s cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.
⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)
Prerequisites
- Dynamo Platform/Operator installed on a k8s cluster with x86_64 (amd64) GPU nodes
- NVIDIA driver 580.xx or newer on the target GPU nodes
ReadWriteManystorage if you need cross-node restore- vLLM or SGLang backend (TensorRT-LLM is not supported yet)
- Security clearance to run a privileged DaemonSet
Quick Start
This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.
1. Build and push a placeholder image
Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:
This flow is defined in deploy/snapshot/Makefile and deploy/snapshot/Dockerfile. The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, cuda-checkpoint, and nsrestore tooling needed for restore.
2. Enable checkpointing in the platform and verify it
Whether you are installing or upgrading dynamo-platform, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:
If the platform is already installed, verify that the operator config contains the checkpoint block:
Verify that the rendered config includes enabled: true and the same PVC name and base path you plan to use for the snapshot chart.
For the full platform/operator configuration surface, see deploy/helm/charts/platform/README.md and deploy/helm/charts/platform/components/operator/values.yaml.
3. Install the snapshot chart
Cross-node restore requires ReadWriteMany storage. The chart defaults to that mode.
For better restore times, use a fast ReadWriteMany StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and point storage.pvc.name at the existing PVC instead.
Verify that the PVC and DaemonSet are ready:
For the full snapshot chart configuration surface, see deploy/helm/charts/snapshot/README.md and deploy/helm/charts/snapshot/values.yaml.
4. Apply a snapshot-compatible DynamoGraphDeployment
This example is adapted from examples/backends/vllm/deploy/agg.yaml. The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.
For SGLang, use dynamo.sglang, an SGLang placeholder image, backendFramework: sglang, and the matching CLI flags.
Apply the manifest:
On the first rollout, the worker cold-starts, the operator creates a DynamoCheckpoint, and the checkpoint Job writes data into snapshot-pvc.
5. Wait for the checkpoint to become ready
Capture the checkpoint name from DGD status, then wait for the DynamoCheckpoint phase to become Ready:
The DGD status also reports the computed checkpoint hash at .status.checkpoints.VllmDecodeWorker.identityHash.
6. Trigger restore
Once the checkpoint is ready, scale the worker replicas from 1 to 2:
New worker pods for VllmDecodeWorker will restore from the ready checkpoint automatically.
Checkpoint Configuration
Auto Mode (Recommended)
The operator computes the checkpoint identity hash, looks for an existing DynamoCheckpoint with a matching nvidia.com/snapshot-checkpoint-hash label, and creates one if it does not find one:
When a service uses checkpointing, DGD status reports the resolved checkpointName, identityHash, and ready fields under .status.checkpoints.<service-name>.
Manual Management and checkpointRef
Use checkpointRef when you want a service to restore from a specific DynamoCheckpoint CR:
This is useful when:
- You want to pre-warm checkpoints before creating DGDs
- You want explicit control over which checkpoint to use
checkpointRef resolves by DynamoCheckpoint.metadata.name, not by status.identityHash. A manual checkpoint can use any valid Kubernetes resource name.
If you are managing checkpoint CRs yourself, set mode: Manual on the service to prevent the operator from creating a new DynamoCheckpoint when identity-based lookup does not find one.
If you want mode: Auto DGDs to discover a manually created checkpoint by identity, add the label nvidia.com/snapshot-checkpoint-hash=<identity-hash> to that DynamoCheckpoint. Auto-created checkpoints already use that label, and currently use the same hash as the CR name.
Checkpoint Identity
Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:
Not included in hash (don’t invalidate checkpoint):
replicasnodeSelector,affinity,tolerationsresources(requests/limits)- Logging/observability config
Example with all fields:
DynamoCheckpoint CRD
The DynamoCheckpoint (shortname: dckpt) is a Kubernetes Custom Resource that manages checkpoint lifecycle.
When to create a DynamoCheckpoint directly:
- Pre-warming: Create checkpoints before deploying DGDs for instant startup
- Explicit control: Manage checkpoint lifecycle independently from DGDs
The operator requires spec.identity and spec.job.podTemplateSpec. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set the checkpoint environment variables manually; the operator injects them for checkpoint jobs and restored pods.
Create a checkpoint:
You can name the CR however you want if you plan to use checkpointRef. If you want mode: Auto identity lookup to find a manual CR, set the nvidia.com/snapshot-checkpoint-hash label to the computed 16-character identity hash. Using the hash as the CR name is a convenient convention, but it is not required.
Check status:
Phases:
Ready is a value in status.phase, not a Kubernetes condition. The conditions array tracks job lifecycle events:
Other useful status fields are:
Detailed status:
Reference from DGD:
Once the checkpoint is Ready, you can reference it by CR name:
Or use mode: Auto with the same identity and snapshot-hash label, and the operator will reuse it automatically.
Limitations
- LLM workers only: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
- Single-GPU only: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
- Network state: No active TCP connections can be checkpointed
- Security: Dynamo Snapshot runs as a privileged DaemonSet which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.
Troubleshooting
Checkpoint Not Ready
-
Check the checkpoint job:
-
Check the DaemonSet:
-
Verify that platform and chart storage settings match:
Restore Failing
-
Check pod logs:
-
Describe the restore target pod:
-
Confirm the referenced checkpoint is still
Ready:
Planned Features
- TensorRT-LLM backend support
- S3/MinIO storage backend
- OCI registry storage backend
- Multi-GPU checkpoints
Related Documentation
- Dynamo Snapshot Helm Chart README - Chart configuration
- Installation Guide - Platform installation
- API Reference - Complete CRD specifications