Snapshot

View as Markdown

⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some k8s cluster setups. The Dynamo Snapshot DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Dynamo Snapshot is an experimental infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in User-space) and NVIDIA’s cuda-checkpoint utility. Dynamo Snapshot dramatically reduces cold-start times for large models from minutes to seconds by capturing initialized application state and restoring it on-demand.

Startup TypeTimeWhat Happens
Cold Start~1 minDownload model, load to GPU, initialize engine
Warm Start (restore from checkpoint)~ 10 secRestore from checkpoint tar

⚠️ Restore time may vary depending on cluster configuration (storage bandwidth, GPU model, etc.)

Prerequisites

  • Dynamo Platform/Operator installed on a k8s cluster with x86_64 (amd64) GPU nodes
  • NVIDIA driver 580.xx or newer on the target GPU nodes
  • ReadWriteMany storage if you need cross-node restore
  • vLLM or SGLang backend (TensorRT-LLM is not supported yet)
  • Security clearance to run a privileged DaemonSet

Quick Start

This guide assumes a normal Dynamo deployment workflow is already present on your Kubernetes cluster.

1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with the restore tooling. If you do not already have one, build it with the snapshot placeholder target and push it to a registry your cluster can pull from:

$export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
$export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
$
$cd deploy/snapshot
$
$make docker-build-placeholder \
> PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
$
$make docker-push-placeholder \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

This flow is defined in deploy/snapshot/Makefile and deploy/snapshot/Dockerfile. The placeholder image preserves the base runtime entrypoint and command contract, and adds the CRIU, cuda-checkpoint, and nsrestore tooling needed for restore.

2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading dynamo-platform, the operator must have checkpointing enabled and must point at the same storage that the snapshot chart will use:

1dynamo-operator:
2 checkpoint:
3 enabled: true
4 storage:
5 type: pvc
6 pvc:
7 pvcName: snapshot-pvc
8 basePath: /checkpoints

If the platform is already installed, verify that the operator config contains the checkpoint block:

$OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
> -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
> -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
$
$kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
> -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'

Verify that the rendered config includes enabled: true and the same PVC name and base path you plan to use for the snapshot chart.

For the full platform/operator configuration surface, see deploy/helm/charts/platform/README.md and deploy/helm/charts/platform/components/operator/values.yaml.

3. Install the snapshot chart

$helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
> --namespace ${NAMESPACE} \
> --create-namespace \
> --set storage.pvc.create=true

Cross-node restore requires ReadWriteMany storage. The chart defaults to that mode.

For better restore times, use a fast ReadWriteMany StorageClass for the checkpoint PVC. If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and point storage.pvc.name at the existing PVC instead.

Verify that the PVC and DaemonSet are ready:

$kubectl get pvc snapshot-pvc -n ${NAMESPACE}
$kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}

For the full snapshot chart configuration surface, see deploy/helm/charts/snapshot/README.md and deploy/helm/charts/snapshot/values.yaml.

4. Apply a snapshot-compatible DynamoGraphDeployment

This example is adapted from examples/backends/vllm/deploy/agg.yaml. The worker must use the placeholder image from step 1, and the checkpoint identity must describe the runtime state you want to reuse.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-snapshot-demo
5spec:
6 services:
7 Frontend:
8 componentType: frontend
9 replicas: 1
10 extraPodSpec:
11 mainContainer:
12 image: registry.example.com/dynamo/vllm-runtime:1.0.0
13
14 VllmDecodeWorker:
15 componentType: worker
16 replicas: 1
17 resources:
18 limits:
19 gpu: "1"
20 readinessProbe:
21 httpGet:
22 path: /live
23 port: system
24 periodSeconds: 1
25 timeoutSeconds: 4
26 failureThreshold: 3
27 checkpoint:
28 enabled: true
29 mode: Auto
30 identity:
31 model: Qwen/Qwen3-0.6B
32 backendFramework: vllm
33 extraPodSpec:
34 mainContainer:
35 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
36 command:
37 - python3
38 - -m
39 - dynamo.vllm
40 args:
41 - --model
42 - Qwen/Qwen3-0.6B
43 - --disable-custom-all-reduce
44 env:
45 - name: GLOO_SOCKET_IFNAME
46 value: lo
47 - name: NCCL_SOCKET_IFNAME
48 value: lo
49 - name: NCCL_DEBUG
50 value: ERROR
51 - name: TORCH_CPP_LOG_LEVEL
52 value: ERROR
53 - name: TORCH_DISTRIBUTED_DEBUG
54 value: "OFF"
55 - name: CUDA_ERROR_LEVEL
56 value: "10"
57 - name: NCCL_CUMEM_ENABLE
58 value: "0"
59 - name: NCCL_CUMEM_HOST_ENABLE
60 value: "0"
61 - name: NCCL_NVLS_ENABLE
62 value: "0"
63 - name: NCCL_P2P_DISABLE
64 value: "0"
65 - name: NCCL_SHM_DISABLE
66 value: "1"
67 - name: NCCL_IB_DISABLE
68 value: "1"
69 - name: TORCH_NCCL_ENABLE_MONITORING
70 value: "0"

For SGLang, use dynamo.sglang, an SGLang placeholder image, backendFramework: sglang, and the matching CLI flags.

Apply the manifest:

$kubectl apply -f vllm-snapshot-demo.yaml -n ${NAMESPACE}

On the first rollout, the worker cold-starts, the operator creates a DynamoCheckpoint, and the checkpoint Job writes data into snapshot-pvc.

5. Wait for the checkpoint to become ready

Capture the checkpoint name from DGD status, then wait for the DynamoCheckpoint phase to become Ready:

$CHECKPOINT_NAME=$(kubectl get dgd vllm-snapshot-demo -n ${NAMESPACE} \
> -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}')
$
$kubectl wait \
> --for=jsonpath='{.status.phase}'=Ready \
> "dynamocheckpoint/${CHECKPOINT_NAME}" \
> -n ${NAMESPACE} \
> --timeout=30m

The DGD status also reports the computed checkpoint hash at .status.checkpoints.VllmDecodeWorker.identityHash.

6. Trigger restore

Once the checkpoint is ready, scale the worker replicas from 1 to 2:

$kubectl patch dgd vllm-snapshot-demo -n ${NAMESPACE} --type=merge \
> -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'

New worker pods for VllmDecodeWorker will restore from the ready checkpoint automatically.

Checkpoint Configuration

The operator computes the checkpoint identity hash, looks for an existing DynamoCheckpoint with a matching nvidia.com/snapshot-checkpoint-hash label, and creates one if it does not find one:

1checkpoint:
2 enabled: true
3 mode: Auto
4 identity:
5 model: "meta-llama/Llama-3-8B"
6 backendFramework: "vllm" # or "sglang"
7 tensorParallelSize: 1
8 dtype: "bfloat16"
9 maxModelLen: 4096

When a service uses checkpointing, DGD status reports the resolved checkpointName, identityHash, and ready fields under .status.checkpoints.<service-name>.

Manual Management and checkpointRef

Use checkpointRef when you want a service to restore from a specific DynamoCheckpoint CR:

1checkpoint:
2 enabled: true
3 checkpointRef: "qwen3-06b-vllm-prewarm"

This is useful when:

  • You want to pre-warm checkpoints before creating DGDs
  • You want explicit control over which checkpoint to use

checkpointRef resolves by DynamoCheckpoint.metadata.name, not by status.identityHash. A manual checkpoint can use any valid Kubernetes resource name.

If you are managing checkpoint CRs yourself, set mode: Manual on the service to prevent the operator from creating a new DynamoCheckpoint when identity-based lookup does not find one.

$# Check checkpoint status by CR name
$kubectl get dynamocheckpoint qwen3-06b-vllm-prewarm -n ${NAMESPACE}
$
$# Now create DGD referencing it
$kubectl apply -f my-dgd.yaml -n ${NAMESPACE}

If you want mode: Auto DGDs to discover a manually created checkpoint by identity, add the label nvidia.com/snapshot-checkpoint-hash=<identity-hash> to that DynamoCheckpoint. Auto-created checkpoints already use that label, and currently use the same hash as the CR name.

Checkpoint Identity

Checkpoints are uniquely identified by a 16-character SHA256 hash (64 bits) of configuration that affects runtime state:

FieldRequiredAffects HashExample
modelmeta-llama/Llama-3-8B
backendFrameworksglang, vllm
dynamoVersion0.9.0, 1.0.0
tensorParallelSize1, 2, 4, 8 (default: 1)
pipelineParallelSize1, 2 (default: 1)
dtypefloat16, bfloat16, fp8
maxModelLen4096, 8192
extraParametersCustom key-value pairs

Not included in hash (don’t invalidate checkpoint):

  • replicas
  • nodeSelector, affinity, tolerations
  • resources (requests/limits)
  • Logging/observability config

Example with all fields:

1checkpoint:
2 enabled: true
3 mode: Auto
4 identity:
5 model: "meta-llama/Llama-3-8B"
6 backendFramework: "vllm"
7 dynamoVersion: "0.9.0"
8 tensorParallelSize: 1
9 pipelineParallelSize: 1
10 dtype: "bfloat16"
11 maxModelLen: 8192
12 extraParameters:
13 enableChunkedPrefill: "true"
14 quantization: "awq"

DynamoCheckpoint CRD

The DynamoCheckpoint (shortname: dckpt) is a Kubernetes Custom Resource that manages checkpoint lifecycle.

When to create a DynamoCheckpoint directly:

  • Pre-warming: Create checkpoints before deploying DGDs for instant startup
  • Explicit control: Manage checkpoint lifecycle independently from DGDs

The operator requires spec.identity and spec.job.podTemplateSpec. The pod template should match the worker container you want checkpointed, including image, command, args, secrets, volumes, and resource limits. You do not need to set the checkpoint environment variables manually; the operator injects them for checkpoint jobs and restored pods.

Create a checkpoint:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoCheckpoint
3metadata:
4 name: qwen3-06b-vllm-prewarm
5 labels:
6 nvidia.com/snapshot-checkpoint-hash: "e5962d34ba272638" # Add this if Auto-mode identity lookup should find the CR
7spec:
8 identity:
9 model: Qwen/Qwen3-0.6B
10 backendFramework: vllm
11 tensorParallelSize: 1
12 dtype: bfloat16
13 maxModelLen: 4096
14
15 job:
16 activeDeadlineSeconds: 3600
17 backoffLimit: 3
18 ttlSecondsAfterFinished: 300
19 podTemplateSpec:
20 spec:
21 restartPolicy: Never
22 containers:
23 - name: main
24 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
25 command:
26 - python3
27 - -m
28 - dynamo.vllm
29 args:
30 - --model
31 - Qwen/Qwen3-0.6B
32 - --disable-custom-all-reduce
33 env:
34 - name: GLOO_SOCKET_IFNAME
35 value: lo
36 - name: NCCL_SOCKET_IFNAME
37 value: lo
38 resources:
39 limits:
40 nvidia.com/gpu: "1"

You can name the CR however you want if you plan to use checkpointRef. If you want mode: Auto identity lookup to find a manual CR, set the nvidia.com/snapshot-checkpoint-hash label to the computed 16-character identity hash. Using the hash as the CR name is a convenient convention, but it is not required.

Check status:

$# List all checkpoints
$kubectl get dynamocheckpoint -n ${NAMESPACE}
$# Or use shortname
$kubectl get dckpt -n ${NAMESPACE}
$
$NAME MODEL BACKEND PHASE HASH AGE
$qwen3-06b-vllm-prewarm Qwen/Qwen3-0.6B vllm Ready e5962d34ba272638 5m
$llama3-8b-vllm-prewarm meta-llama/Llama-3-8B vllm Creating 7ab4f89c12de3456 2m

Phases:

PhaseDescription
PendingCR created, waiting for job to start
CreatingCheckpoint job is running
ReadyCheckpoint available for use
FailedCheckpoint creation failed

Ready is a value in status.phase, not a Kubernetes condition. The conditions array tracks job lifecycle events:

Condition TypeMeaning
JobCreatedThe checkpoint Job has been created
JobCompletedThe checkpoint Job has completed successfully or failed

Other useful status fields are:

FieldMeaning
status.jobNameName of the checkpoint Job
status.identityHashComputed 16-character hash for the checkpoint identity
status.locationCheckpoint location in the configured storage backend
status.storageTypeStorage backend type (pvc, s3, or oci)
status.createdAtTimestamp recorded when the checkpoint becomes ready
status.messageFailure or progress message when available

Detailed status:

$kubectl describe dckpt qwen3-06b-vllm-prewarm -n ${NAMESPACE}
1Status:
2 Phase: Ready
3 IdentityHash: e5962d34ba272638
4 JobName: checkpoint-qwen3-06b-vllm-prewarm
5 Location: /checkpoints/e5962d34ba272638.tar
6 StorageType: pvc
7 CreatedAt: 2026-01-29T10:05:00Z
8 Conditions:
9 - Type: JobCreated
10 Status: "True"
11 Reason: JobCreated
12 - Type: JobCompleted
13 Status: "True"
14 Reason: JobSucceeded

Reference from DGD:

Once the checkpoint is Ready, you can reference it by CR name:

1spec:
2 services:
3 VllmDecodeWorker:
4 checkpoint:
5 enabled: true
6 checkpointRef: "qwen3-06b-vllm-prewarm"

Or use mode: Auto with the same identity and snapshot-hash label, and the operator will reuse it automatically.

Limitations

  • LLM workers only: Checkpoint/restore supports LLM decode and prefill workers. Specialized workers (multimodal, embedding, diffusion) are not supported.
  • Single-GPU only: Multi-GPU configurations may work in very basic hardware configurations, but are not officially supported yet.
  • Network state: No active TCP connections can be checkpointed
  • Security: Dynamo Snapshot runs as a privileged DaemonSet which is required to run CRIU and cuda-checkpoint. However, workload pods do not need to be privileged.

Troubleshooting

Checkpoint Not Ready

  1. Check the checkpoint job:

    $kubectl get dckpt -n ${NAMESPACE}
    $kubectl describe dckpt <checkpoint-name> -n ${NAMESPACE}
    $kubectl logs job/$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}') -n ${NAMESPACE}
  2. Check the DaemonSet:

    $kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers
  3. Verify that platform and chart storage settings match:

    $kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o yaml

Restore Failing

  1. Check pod logs:

    $kubectl logs <worker-pod> -n ${NAMESPACE}
  2. Describe the restore target pod:

    $kubectl describe pod <worker-pod> -n ${NAMESPACE}
  3. Confirm the referenced checkpoint is still Ready:

    $kubectl get dckpt <checkpoint-name> -n ${NAMESPACE}

Planned Features

  • TensorRT-LLM backend support
  • S3/MinIO storage backend
  • OCI registry storage backend
  • Multi-GPU checkpoints