⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The
snapshot-agentDaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.
Dynamo Snapshot is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA’s cuda-checkpoint utility. The usual flow is:
⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.
For more background on the snapshot architecture and startup improvements, see NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes.
amd64) GPU nodesReadWriteMany is the safest default for cross-node or
concurrent multi-node access, but podMount mode can also use suitable
ReadWriteOnce storage for sequential checkpoint/restore workflows.runtime.type=crio on the snapshot chart (and openshift.enabled=true on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.DynamoCheckpoint CRDynamoCheckpoint and wait for it to become readyDynamoGraphDeployment that restores from the corresponding checkpointRefSnapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:
The placeholder image preserves the normal runtime entrypoint/command contract and adds the criu, cuda-checkpoint, and nsrestore tooling needed for checkpoint and restore.
To build either snapshot image against a custom CRIU fork or ref, pass
CRIU_REPO and CRIU_REF through make. If they are unset, the Dockerfile
defaults are used.
Whether you are installing or upgrading dynamo-platform, the operator only needs checkpointing enabled:
If the platform is already installed, verify that the operator config contains the checkpoint block:
Verify that the rendered config includes enabled: true.
For the default namespace-local mode, install the snapshot chart in each workload namespace. The chart creates the PVC and the agent in that namespace:
In the default agentMount mode, the snapshot-agent DaemonSet mounts the
checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on
multiple nodes may mount the same PVC, so the PVC generally needs
ReadWriteMany. The chart defaults to that mode. If your cluster does not have
a default storage class, also set storage.pvc.storageClass.
If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and set storage.pvc.name instead.
CRI-O or OpenShift: append for example --set runtime.type=crio and, on OpenShift, --set openshift.enabled=true (see deploy/helm/charts/snapshot/README.md).
For clusters that prefer one privileged snapshot agent instead of one DaemonSet per workload namespace, install the chart once in an infrastructure namespace. In this mode the chart does not create workload PVCs; the Dynamo operator either creates each namespace-local PVC or verifies that it already exists:
To let the operator create the workload PVC in each namespace that uses
checkpoint/restore, configure the operator with create: true:
The chart and operator use separate configuration surfaces here: the snapshot
chart PVC name is storage.pvc.name, while the operator config field is
checkpoint.storage.pvc.pvcName.
This is a key difference from agentMount: podMount removes the requirement
that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node.
Only the active checkpoint/restore workload pod mounts the PVC, and the agent
reaches it through that pod’s mount namespace. ReadWriteMany remains the
safest operator-managed default, especially when multiple checkpoint/restore
pods may access the same PVC concurrently or when restore scheduling can span
nodes. Suitable ReadWriteOnce storage classes can still be used for
sequential podMount checkpoint/restore flows when the backend can attach the
volume to the node running the active workload pod.
podMount depends on the target container remaining alive while the agent
resolves /host/proc/<pid>/root/<basePath>. If the container exits or restarts
during checkpoint/restore setup, if the runtime cannot expose a stable host PID,
or if node security settings prevent host proc traversal, the agent fails or
skips that attempt and Kubernetes/operator reconciliation must try again after a
fresh container is available.
To use an already-present PVC instead, omit create or set it to false. The
operator will fail reconciliation with a clear error if the named PVC does not
exist in the workload namespace.
Verify that the DaemonSet is ready. After a checkpoint or restore workload is reconciled, verify the workload namespace PVC:
DynamoCheckpointThe checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the legacy spec.identity metadata, a container named main, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only main is checkpointed unless spec.job.targetContainerName selects another container.
GMS + Snapshot support is currently disabled.
For a full working example, see deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml.
Apply it:
The useful status fields are:
status.phase: high-level lifecycle (Pending, Creating, Ready, Failed)status.checkpointID: artifact ID used by the snapshot protocolstatus.identityHash: deprecated compatibility alias for status.checkpointIDstatus.jobName: checkpoint Job namestatus.createdAt: timestamp recorded when the checkpoint became readystatus.message: progress or failure detail when availableDynamoGraphDeployment that restores from checkpointRefOnce the checkpoint is Ready, restore a worker from it explicitly:
Apply it:
The VllmDecodeWorker pod should restore from the ready checkpoint instead of creating a new one.
checkpointRef is the most explicit path. If you set it, the DGD uses that
existing DynamoCheckpoint and does not create a new automatic checkpoint for
the component. This is the escape hatch for users who intentionally want to
reuse a pre-existing checkpoint and accept the compatibility risk.
Without checkpointRef, mode: Auto is the DGD-managed path: for each
checkpoint-enabled worker generation, the DGD controller creates a DGD-scoped
DynamoCheckpoint and the checkpoint controller starts a checkpoint Job.
Automatic DGD checkpoints are not reused across DGDs, even when two manifests
are identical.
The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash keeps a scale down/up on the same worker generation using the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation.
By default, startupPolicy: Immediate starts workers cold while the checkpoint job runs in the background. Once the checkpoint becomes Ready, only newly-created Pods restore from it. Existing Pods are not mutated or restarted just because the checkpoint became ready.
If you want workers to wait for the checkpoint before starting, set startupPolicy: WaitForCheckpoint. That policy keeps normal worker replicas at zero until the checkpoint is Ready, then starts workers from the checkpoint.
Inside a DynamoGraphDeployment, it looks like this:
The legacy checkpoint.identity field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone DynamoCheckpoint workflows.
Useful inspection commands:
If you use the default Immediate policy and want to create restored pods after the checkpoint becomes ready, scale the worker:
Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see Shadow Engine Failover.
snapshotctlIt is possible to checkpoint and restore pods without the Dynamo operator via the lower-level snapshotctl utility. However, the snapshot helm chart must be installed, with a running snapshot-agent DaemonSet in the namespace with the checkpoint PVC mounted.
snapshotctl is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see deploy/snapshot/cmd/snapshotctl/README.md.
The checkpoint manifest must be for a pod and use a placeholder image. --container names the workload container to checkpoint.
If you do not pass --checkpoint-id, snapshotctl generates one and prints it:
This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.
This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.
status.checkpointID is the artifact ID used by the snapshot protocol and the
directory name under checkpoint storage. For DGD-managed automatic checkpoints,
this ID is scoped to a single DGD/component worker generation. It is not a
compatibility claim across DGDs, and identical manifests are not treated as
proof that a checkpoint can be reused safely.
The legacy spec.identity shape is still required on standalone
DynamoCheckpoint objects and remains the fallback for explicit/manual
workflows. When a standalone checkpoint does not already have
status.checkpointID or the checkpoint-ID label, the operator computes the
legacy 16-character SHA256 hash (64 bits) from these fields:
Fields that do not change the legacy hash include:
nodeSelector, affinity, tolerations)DGD-managed automatic checkpoints ignore this legacy identity as a reuse
boundary. The DGD controller creates its own DGD-scoped checkpoint ID and
synthesizes a legacy identity only because the v1alpha1 DynamoCheckpoint API
still requires the field.
DynamoCheckpoint CRDThe DynamoCheckpoint (shortname: dckpt) is the operator-managed resource for checkpoint lifecycle.
Use it when you want:
DynamoGraphDeployment existscheckpointRefThe operator requires:
spec.identityspec.job.podTemplateSpecspec.job.backoffLimit is deprecated and ignored. Checkpoint Jobs are always single-attempt.
Check status with:
The status block looks like:
startupPolicy: Immediate, only Pods created after a checkpoint is Ready are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is.snapshot-agent must run privileged to execute CRIU and cuda-checkpoint. Workload pods do not need to be privileged.ReadySnapshot only becomes Ready after snapshot-agent confirms the checkpoint contents. A completed Job is not enough by itself.
If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.
For the default agentMount install, restore discovers checkpoint storage from
the snapshot-agent DaemonSet in the workload namespace. That DaemonSet must be
ready and must mount the checkpoint PVC.
For a shared-agent podMount install, the snapshot-agent DaemonSet can run in
the infrastructure namespace instead. Verify the shared-agent pods there, then
verify that the workload namespace has the checkpoint PVC that the operator
created or validated:
In podMount mode the agent reaches the checkpoint through the workload pod’s
mount namespace rather than by mounting the PVC itself. Check the workload pod’s
checkpoint storage annotations and the snapshot-agent logs to see the actual
resolved checkpoint path. snapshotctl uses the chart’s storage resolution
path, so for lower-level snapshotctl debugging make sure the snapshot chart
configuration matches the access mode you are testing.
snapshotctl manifest is rejected or the restore target is wrongsnapshotctl requires a Pod manifest and a target-container list. Multi-container manifests are supported as long as every name passed via --container or --containers exists in the pod spec.
If the manifest already carries snapshot target metadata, it must agree with the CLI flag; snapshotctl rejects mismatches instead of silently picking one.