Snapshot | NVIDIA Dynamo Documentation

⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The snapshot-agent DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Dynamo Snapshot is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA’s cuda-checkpoint utility. The usual flow is:

start a worker once and checkpoint its initialized state
store that checkpoint on a namespace-local snapshot volume
restore later workers from that checkpoint instead of cold-starting again

Startup Type	Time	What Happens
Cold Start	~1 min	Download model, load to GPU, initialize engine
Warm Start (restore from checkpoint)	~10 sec	Restore from a ready checkpoint directory

⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.

Prerequisites

x86_64 (amd64) GPU nodes
NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots)
vLLM backend today, with limited preview support
ReadWriteMany storage for cross-node restore
CRI-O / OpenShift: set runtime.type=crio on the snapshot chart (and openshift.enabled=true on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.

Quick Start via `DynamoCheckpoint` CR

Build a placeholder image
Install the snapshot chart
Create a DynamoCheckpoint and wait for it to become ready
Deploy a DynamoGraphDeployment that restores from the corresponding checkpointRef

1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:

$ export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
$ export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
$ 
$ cd deploy/snapshot
$ 
$ make docker-build-placeholder \
>   PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
>   PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
$ 
$ make docker-push-placeholder \
>   PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

The placeholder image preserves the normal runtime entrypoint/command contract and adds the criu, cuda-checkpoint, and nsrestore tooling needed for checkpoint and restore.

To build either snapshot image against a custom CRIU fork or ref, pass CRIU_REPO and CRIU_REF through make. If they are unset, the Dockerfile defaults are used.

$ make docker-build-agent \
>   IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
>   CRIU_REPO="${YOUR_CRIU_REPO}" \
>   CRIU_REF="branch-or-sha"
$ 
$ make docker-build-placeholder \
>   PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
>   PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
>   CRIU_REPO="${YOUR_CRIU_REPO}" \
>   CRIU_REF="branch-or-sha"

2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading dynamo-platform, the operator only needs checkpointing enabled:

1 dynamo-operator:
2   checkpoint:
3     enabled: true

If the platform is already installed, verify that the operator config contains the checkpoint block:

$ OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
>   -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
>   -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
$ 
$ kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
>   -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'

Verify that the rendered config includes enabled: true.

3. Install the snapshot chart in the workload namespace

$ helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
>   --namespace ${NAMESPACE} \
>   --create-namespace \
>   --set storage.pvc.create=true

Cross-node restore requires shared ReadWriteMany storage. The chart defaults to that mode. If your cluster does not have a default storage class, also set storage.pvc.storageClass.

If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and set storage.pvc.name instead.

CRI-O or OpenShift: append for example --set runtime.type=crio and, on OpenShift, --set openshift.enabled=true (see deploy/helm/charts/snapshot/README.md).

Verify that the PVC and DaemonSet are ready:

$ kubectl get pvc snapshot-pvc -n ${NAMESPACE}
$ kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
$ kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide

4. Create a `DynamoCheckpoint`

The checkpoint Job pod template should match the worker container you want to checkpoint. For the snapshot flow, the important parts are the checkpoint identity, a container named main, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only main is checkpointed.

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoCheckpoint
3 metadata:
4   name: qwen3-06b-bf16
5 spec:
6   identity:
7     model: Qwen/Qwen3-0.6B
8     backendFramework: vllm
9     tensorParallelSize: 1
10     dtype: bfloat16
11     maxModelLen: 2048
12 
13   job:
14     activeDeadlineSeconds: 3600
15     podTemplateSpec:
16       spec:
17         ...
18         containers:
19           - name: main
20             image: registry.example.com/dynamo/vllm-placeholder:1.0.0
21             ...

Leave spec.gpuMemoryService.enabled unset or false. Snapshot plus GPU Memory Service is not yet available, and admission rejects DynamoCheckpoint objects with spec.gpuMemoryService.enabled: true. See Shadow Engine Failover for the current GMS support status.

For a full working example, see deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml.

Apply it:

$ kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE}

5. Wait for the checkpoint to become ready

$ kubectl get dckpt -n ${NAMESPACE} \
>   -o custom-columns=NAME:.metadata.name,HASH:.status.identityHash,PHASE:.status.phase
$ 
$ kubectl wait \
>   --for=jsonpath='{.status.phase}'=Ready \
>   dynamocheckpoint/qwen3-06b-bf16 \
>   -n ${NAMESPACE} \
>   --timeout=30m

The useful status fields are:

status.phase: high-level lifecycle (Pending, Creating, Ready, Failed)
status.identityHash: deterministic hash of spec.identity
status.jobName: checkpoint Job name
status.createdAt: timestamp recorded when the checkpoint became ready
status.message: progress or failure detail when available

6. Deploy a `DynamoGraphDeployment` that restores from `checkpointRef`

Once the checkpoint is Ready, restore a worker from it explicitly:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: vllm-checkpointref-demo
5 spec:
6   services:
7     Frontend:
8       componentType: frontend
9       replicas: 1
10       extraPodSpec:
11         mainContainer:
12           image: registry.example.com/dynamo/vllm-runtime:1.0.0
13 
14     VllmDecodeWorker:
15       componentType: worker
16       replicas: 1
17       checkpoint:
18         enabled: true
19         checkpointRef: qwen3-06b-bf16
20       extraPodSpec:
21         mainContainer:
22           image: registry.example.com/dynamo/vllm-placeholder:1.0.0
23           ...
24         ...

Apply it:

$ kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE}
$ kubectl get pods -n ${NAMESPACE} -w

The VllmDecodeWorker pod should restore from the ready checkpoint instead of creating a new one.

DGD Auto Flow

checkpointRef is the most explicit path. mode: Auto is the higher-level path: the operator computes the checkpoint identity hash, looks for an equivalent DynamoCheckpoint, and creates one only when no matching checkpoint exists. If a DynamoCheckpoint already exists with the same identity, Auto mode reuses it. If no matching checkpoint exists yet, the first worker cold-starts and the operator creates the checkpoint in the background.

1 checkpoint:
2   enabled: true
3   mode: Auto
4   identity:
5     model: Qwen/Qwen3-0.6B
6     backendFramework: vllm
7     tensorParallelSize: 1
8     dtype: bfloat16
9     maxModelLen: 2048

Inside a DynamoGraphDeployment, it looks like this:

1 apiVersion: nvidia.com/v1alpha1
2 kind: DynamoGraphDeployment
3 metadata:
4   name: vllm-auto-demo
5 spec:
6   services:
7     Frontend:
8       componentType: frontend
9       replicas: 1
10       extraPodSpec:
11         mainContainer:
12           image: registry.example.com/dynamo/vllm-runtime:1.0.0
13 
14     VllmDecodeWorker:
15       componentType: worker
16       replicas: 1
17       checkpoint:
18         enabled: true
19         mode: Auto
20         identity:
21           model: Qwen/Qwen3-0.6B
22           backendFramework: vllm
23           tensorParallelSize: 1
24           dtype: bfloat16
25           maxModelLen: 2048
26       extraPodSpec:
27         mainContainer:
28           image: registry.example.com/dynamo/vllm-placeholder:1.0.0
29           ...
30         ...

Auto mode only hashes checkpoint.identity. GMS-specific checkpoint behavior is not yet available.

Useful inspection commands:

$ kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \
>   -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.identityHash}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}'
$ 
$ kubectl get dckpt -n ${NAMESPACE}

If you want to force a new restore after the checkpoint becomes ready, scale the worker:

$ kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \
>   -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'

Failover Restore

Failover restore is not yet available. The current Snapshot flow does not support Snapshot plus GMS, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see Shadow Engine Failover.

Lower-Level Testing With `snapshotctl`

It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level snapshotctl utility. However, the snapshot helm chart must be installed, with a running snapshot-agent DaemonSet in the namespace with the checkpoint PVC mounted.

snapshotctl is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see deploy/snapshot/cmd/snapshotctl/README.md.

Checkpoint from a worker pod manifest

$ snapshotctl checkpoint \
>   --manifest ./worker-pod.yaml \
>   --container main \
>   --namespace ${NAMESPACE}

The checkpoint manifest must be for a pod and use a placeholder image. --container names the workload container to checkpoint.

If you do not pass --checkpoint-id, snapshotctl generates one and prints it:

status=completed
namespace=...
name=...
checkpoint_job=...
checkpoint_id=manual-snapshot-...
checkpoint_location=/checkpoints/...

Restore from a worker pod manifest

$ snapshotctl restore \
>   --manifest ./worker-pod.yaml \
>   --namespace ${NAMESPACE} \
>   --checkpoint-id manual-snapshot-... \
>   --containers main

This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.

Restore an existing pod in place

$ snapshotctl restore \
>   --pod existing-restore-target \
>   --namespace ${NAMESPACE} \
>   --checkpoint-id manual-snapshot-... \
>   --containers main

This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.

Checkpoint Identity

Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:

Field	Required	Affects Hash	Example
`model`	✓	✓	`meta-llama/Llama-3-8B`
`backendFramework`	✓	✓	`vllm`
`dynamoVersion`		✓	`0.9.0`, `1.0.0`
`tensorParallelSize`		✓	`1`, `2`, `4`, `8`
`pipelineParallelSize`		✓	`1`, `2`
`dtype`		✓	`float16`, `bfloat16`, `fp8`
`maxModelLen`		✓	`4096`, `8192`
`extraParameters`		✓	custom key-value pairs

Fields that do not change the checkpoint hash include:

replica count
node placement (nodeSelector, affinity, tolerations)
resource requests/limits
logging or observability configuration

`DynamoCheckpoint` CRD

The DynamoCheckpoint (shortname: dckpt) is the operator-managed resource for checkpoint lifecycle.

Use it when you want:

pre-warmed checkpoints before any DynamoGraphDeployment exists
explicit lifecycle control independent from a DGD
a stable human-readable name that services can reference with checkpointRef

The operator requires:

spec.identity
spec.job.podTemplateSpec

spec.job.backoffLimit is deprecated and ignored. Checkpoint Jobs are always single-attempt.

Check status with:

$ kubectl get dckpt -n ${NAMESPACE}
$ kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
$ kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml

The status block looks like:

1 status:
2   phase: Ready
3   identityHash: 3bff874d069f0ed5
4   jobName: checkpoint-job-3bff874d069f0ed5-1
5   createdAt: "2026-01-29T10:05:00Z"
6   message: ""

Limitations

Backend support is limited: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview.
Worker coverage is narrow: specialized workers such as multimodal, embedding, and diffusion are not supported.
Multi-GPU remains preview: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters.
GMS restore is not yet available: Snapshot plus GPU Memory Service is blocked by admission.
Network state is sensitive: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today.
Privileged DaemonSet required: snapshot-agent must run privileged to execute CRIU and cuda-checkpoint. Workload pods do not need to be privileged.

Troubleshooting

Checkpoint Job finishes but the checkpoint never becomes `Ready`

Snapshot only becomes Ready after snapshot-agent confirms the checkpoint contents. A completed Job is not enough by itself.

$ kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} \
>   -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName
$ 
$ JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
$ if [ -n "${JOB_NAME}" ]; then
$   kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
$ fi
$ 
$ kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers

If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.

Restore cannot find or mount checkpoint storage

Restore discovers checkpoint storage from the snapshot-agent DaemonSet in the same namespace. That DaemonSet must be ready and must mount the checkpoint PVC.

$ kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
$ kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide
$ kubectl get pvc -n ${NAMESPACE}

This is also the path that snapshotctl uses when it resolves checkpoint storage.

`snapshotctl` manifest is rejected or the restore target is wrong

snapshotctl requires a Pod manifest and a target-container list. Multi-container manifests are supported as long as every name passed via --container or --containers exists in the pod spec.

$ snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE}
$ snapshotctl restore  --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id <checkpoint-id>

If the manifest already carries snapshot target metadata, it must agree with the CLI flag; snapshotctl rejects mismatches instead of silently picking one.

Planned Features

Stabilize multi-GPU support
Additional backend support
Alternative storage backends