For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Kubernetes Deployment
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
    • Glossary
  • Digest
    • NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
    • DynoSim: Simulating the Pareto Frontier
    • Dynamo Day 0 support for TokenSpeed
    • Multi-Turn Agentic Harnesses
    • Full-Stack Optimizations for Agentic Inference
    • Flash Indexer: Inter-Galactic KV Routing
  • Kubernetes Deployment
      • Service Discovery
      • Webhooks
      • Snapshotting GPU Workers
      • Shadow Engine Failover
      • Developing with Tilt
  • Feature Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Benchmarking
    • Tool Calling & Reasoning Parsing
    • Fault Tolerance
    • Observability (Local)
    • Inference Simulation
    • Agents
    • LoRA Adapters
    • Multimodal
    • Diffusion
    • Fastokens Tokenizer
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
  • Documentation
    • Dynamo Docs Guide
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Prerequisites
  • Quick Start via DynamoCheckpoint CR
  • 1. Build and push a placeholder image
  • 2. Enable checkpointing in the platform and verify it
  • 3. Install the snapshot chart
  • 4. Create a standalone DynamoCheckpoint
  • 5. Wait for the checkpoint to become ready
  • 6. Deploy a DynamoGraphDeployment that restores from checkpointRef
  • DGD Auto Flow
  • Failover Restore
  • Lower-Level Testing With snapshotctl
  • Checkpoint from a worker pod manifest
  • Restore from a worker pod manifest
  • Restore an existing pod in place
  • Checkpoint IDs and Legacy Identity
  • DynamoCheckpoint CRD
  • Limitations
  • Troubleshooting
  • Checkpoint Job finishes but the checkpoint never becomes Ready
  • Restore cannot find or mount checkpoint storage
  • snapshotctl manifest is rejected or the restore target is wrong
  • Planned Features
  • Related Documentation
Kubernetes DeploymentAdvanced Platform

Snapshotting GPU Workers

||View as Markdown|
Previous

Webhooks

Next

Shadow Engine Failover

⚠️ Experimental Feature: Dynamo Snapshot is currently in preview and may only be functional in some cluster setups. The snapshot-agent DaemonSet runs in privileged mode to perform CRIU operations. See Limitations for details.

Dynamo Snapshot is infrastructure for fast-starting GPU applications in Kubernetes using CRIU (Checkpoint/Restore in Userspace) and NVIDIA’s cuda-checkpoint utility. The usual flow is:

  1. start a worker once and checkpoint its initialized state
  2. store that checkpoint on a namespace-local snapshot volume
  3. restore later workers from that checkpoint instead of cold-starting again
Startup TypeTimeWhat Happens
Cold Start~1 minDownload model, load to GPU, initialize engine
Warm Start (restore from checkpoint)~10 secRestore from a ready checkpoint directory

⚠️ Restore time depends on storage bandwidth, GPU model, and whether the restore stays on the same node.

For more background on the snapshot architecture and startup improvements, see NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes.

Prerequisites

  • x86_64 (amd64) GPU nodes
  • NVIDIA driver 580.xx or newer on the target GPU nodes (590.xx or newer if testing multi-GPU snapshots)
  • vLLM or SGLang backend today
  • Checkpoint storage. ReadWriteMany is the safest default for cross-node or concurrent multi-node access, but podMount mode can also use suitable ReadWriteOnce storage for sequential checkpoint/restore workflows.
  • CRI-O / OpenShift: set runtime.type=crio on the snapshot chart (and openshift.enabled=true on OpenShift). Defaults are for containerd; see the chart README for sockets and Helm flags.

Quick Start via DynamoCheckpoint CR

  1. Build a placeholder image
  2. Install the snapshot chart
  3. Create a DynamoCheckpoint and wait for it to become ready
  4. Deploy a DynamoGraphDeployment that restores from the corresponding checkpointRef

1. Build and push a placeholder image

Snapshot-enabled workers must use a placeholder image that wraps the normal runtime image with restore tooling. If you do not already have one, build it and push it to a registry your cluster can pull from:

$export RUNTIME_IMAGE=registry.example.com/dynamo/vllm-runtime:1.0.0
$export PLACEHOLDER_IMAGE=registry.example.com/dynamo/vllm-placeholder:1.0.0
$
$cd deploy/snapshot
$
$make docker-build-placeholder \
> PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"
$
$make docker-push-placeholder \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}"

The placeholder image preserves the normal runtime entrypoint/command contract and adds the criu, cuda-checkpoint, and nsrestore tooling needed for checkpoint and restore.

To build either snapshot image against a custom CRIU fork or ref, pass CRIU_REPO and CRIU_REF through make. If they are unset, the Dockerfile defaults are used.

$make docker-build-agent \
> IMG=registry.example.com/dynamo/snapshot-agent:1.0.0 \
> CRIU_REPO="${YOUR_CRIU_REPO}" \
> CRIU_REF="branch-or-sha"
$
$make docker-build-placeholder \
> PLACEHOLDER_BASE_IMG="${RUNTIME_IMAGE}" \
> PLACEHOLDER_IMG="${PLACEHOLDER_IMAGE}" \
> CRIU_REPO="${YOUR_CRIU_REPO}" \
> CRIU_REF="branch-or-sha"

2. Enable checkpointing in the platform and verify it

Whether you are installing or upgrading dynamo-platform, the operator only needs checkpointing enabled:

1dynamo-operator:
2 checkpoint:
3 enabled: true

If the platform is already installed, verify that the operator config contains the checkpoint block:

$OPERATOR_CONFIG=$(kubectl get deploy -n "${PLATFORM_NAMESPACE}" \
> -l app.kubernetes.io/name=dynamo-operator,app.kubernetes.io/component=manager \
> -o jsonpath='{.items[0].spec.template.spec.volumes[?(@.name=="operator-config")].configMap.name}')
$
$kubectl get configmap "${OPERATOR_CONFIG}" -n "${PLATFORM_NAMESPACE}" \
> -o jsonpath='{.data.config\.yaml}' | sed -n '/^checkpoint:/,/^[^[:space:]]/p'

Verify that the rendered config includes enabled: true.

3. Install the snapshot chart

For the default namespace-local mode, install the snapshot chart in each workload namespace. The chart creates the PVC and the agent in that namespace:

$helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
> --namespace ${NAMESPACE} \
> --create-namespace \
> --set storage.pvc.create=true

In the default agentMount mode, the snapshot-agent DaemonSet mounts the checkpoint PVC directly. On a multi-node GPU cluster that means agent pods on multiple nodes may mount the same PVC, so the PVC generally needs ReadWriteMany. The chart defaults to that mode. If your cluster does not have a default storage class, also set storage.pvc.storageClass.

If you are reusing an existing checkpoint PVC, do not set storage.pvc.create=true; install the chart with storage.pvc.create=false and set storage.pvc.name instead.

CRI-O or OpenShift: append for example --set runtime.type=crio and, on OpenShift, --set openshift.enabled=true (see deploy/helm/charts/snapshot/README.md).

For clusters that prefer one privileged snapshot agent instead of one DaemonSet per workload namespace, install the chart once in an infrastructure namespace. In this mode the chart does not create workload PVCs; the Dynamo operator either creates each namespace-local PVC or verifies that it already exists:

$helm upgrade --install snapshot ./deploy/helm/charts/snapshot \
> --namespace dynamo-system \
> --create-namespace \
> --set storage.accessMode=podMount \
> --set storage.pvc.create=false \
> --set rbac.namespaceRestricted=false

To let the operator create the workload PVC in each namespace that uses checkpoint/restore, configure the operator with create: true:

1dynamo-operator:
2 checkpoint:
3 enabled: true
4 storage:
5 type: pvc
6 pvc:
7 pvcName: snapshot-pvc
8 basePath: /checkpoints
9 create: true
10 size: 1Ti
11 storageClassName: ""
12 accessMode: ReadWriteMany

The chart and operator use separate configuration surfaces here: the snapshot chart PVC name is storage.pvc.name, while the operator config field is checkpoint.storage.pvc.pvcName.

This is a key difference from agentMount: podMount removes the requirement that the snapshot-agent DaemonSet mount the checkpoint PVC on every GPU node. Only the active checkpoint/restore workload pod mounts the PVC, and the agent reaches it through that pod’s mount namespace. ReadWriteMany remains the safest operator-managed default, especially when multiple checkpoint/restore pods may access the same PVC concurrently or when restore scheduling can span nodes. Suitable ReadWriteOnce storage classes can still be used for sequential podMount checkpoint/restore flows when the backend can attach the volume to the node running the active workload pod.

podMount depends on the target container remaining alive while the agent resolves /host/proc/<pid>/root/<basePath>. If the container exits or restarts during checkpoint/restore setup, if the runtime cannot expose a stable host PID, or if node security settings prevent host proc traversal, the agent fails or skips that attempt and Kubernetes/operator reconciliation must try again after a fresh container is available.

To use an already-present PVC instead, omit create or set it to false. The operator will fail reconciliation with a clear error if the named PVC does not exist in the workload namespace.

Verify that the DaemonSet is ready. After a checkpoint or restore workload is reconciled, verify the workload namespace PVC:

$kubectl rollout status daemonset/snapshot-agent -n dynamo-system
$kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
$kubectl get pvc snapshot-pvc -n ${NAMESPACE}

4. Create a standalone DynamoCheckpoint

The checkpoint Job pod template should match the worker container you want to checkpoint. For a standalone checkpoint, the important parts are the legacy spec.identity metadata, a container named main, and the placeholder image; the rest of the pod template should mirror your normal worker config. Extra containers are allowed, but only main is checkpointed unless spec.job.targetContainerName selects another container.

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoCheckpoint
3metadata:
4 name: qwen3-06b-bf16
5spec:
6 identity:
7 model: Qwen/Qwen3-0.6B
8 backendFramework: vllm
9 tensorParallelSize: 1
10 dtype: bfloat16
11 maxModelLen: 2048
12
13 job:
14 activeDeadlineSeconds: 3600
15 podTemplateSpec:
16 spec:
17 ...
18 containers:
19 - name: main
20 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
21 ...

GMS + Snapshot support is currently disabled.

For a full working example, see deploy/operator/config/samples/nvidia.com_v1alpha1_dynamocheckpoint.yaml.

Apply it:

$kubectl apply -f qwen3-checkpoint.yaml -n ${NAMESPACE}

5. Wait for the checkpoint to become ready

$kubectl get dckpt -n ${NAMESPACE} \
> -o custom-columns=NAME:.metadata.name,CHECKPOINT_ID:.status.checkpointID,PHASE:.status.phase
$
$kubectl wait \
> --for=jsonpath='{.status.phase}'=Ready \
> dynamocheckpoint/qwen3-06b-bf16 \
> -n ${NAMESPACE} \
> --timeout=30m

The useful status fields are:

  • status.phase: high-level lifecycle (Pending, Creating, Ready, Failed)
  • status.checkpointID: artifact ID used by the snapshot protocol
  • status.identityHash: deprecated compatibility alias for status.checkpointID
  • status.jobName: checkpoint Job name
  • status.createdAt: timestamp recorded when the checkpoint became ready
  • status.message: progress or failure detail when available

6. Deploy a DynamoGraphDeployment that restores from checkpointRef

Once the checkpoint is Ready, restore a worker from it explicitly:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-checkpointref-demo
5spec:
6 services:
7 Frontend:
8 componentType: frontend
9 replicas: 1
10 extraPodSpec:
11 mainContainer:
12 image: registry.example.com/dynamo/vllm-runtime:1.0.0
13
14 VllmDecodeWorker:
15 componentType: worker
16 replicas: 1
17 checkpoint:
18 enabled: true
19 checkpointRef: qwen3-06b-bf16
20 extraPodSpec:
21 mainContainer:
22 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
23 ...
24 ...

Apply it:

$kubectl apply -f vllm-checkpointref-demo.yaml -n ${NAMESPACE}
$kubectl get pods -n ${NAMESPACE} -w

The VllmDecodeWorker pod should restore from the ready checkpoint instead of creating a new one.

DGD Auto Flow

checkpointRef is the most explicit path. If you set it, the DGD uses that existing DynamoCheckpoint and does not create a new automatic checkpoint for the component. This is the escape hatch for users who intentionally want to reuse a pre-existing checkpoint and accept the compatibility risk.

Without checkpointRef, mode: Auto is the DGD-managed path: for each checkpoint-enabled worker generation, the DGD controller creates a DGD-scoped DynamoCheckpoint and the checkpoint controller starts a checkpoint Job. Automatic DGD checkpoints are not reused across DGDs, even when two manifests are identical.

The automatic checkpoint ID is derived from the DGD namespace/name/UID, component name, and active worker hash. The DGD UID prevents cross-DGD reuse; the worker hash keeps a scale down/up on the same worker generation using the same DGD-scoped checkpoint while creating a new checkpoint for a new worker generation.

By default, startupPolicy: Immediate starts workers cold while the checkpoint job runs in the background. Once the checkpoint becomes Ready, only newly-created Pods restore from it. Existing Pods are not mutated or restarted just because the checkpoint became ready.

If you want workers to wait for the checkpoint before starting, set startupPolicy: WaitForCheckpoint. That policy keeps normal worker replicas at zero until the checkpoint is Ready, then starts workers from the checkpoint.

1checkpoint:
2 enabled: true
3 mode: Auto
4 startupPolicy: Immediate # default; optional

Inside a DynamoGraphDeployment, it looks like this:

1apiVersion: nvidia.com/v1alpha1
2kind: DynamoGraphDeployment
3metadata:
4 name: vllm-auto-demo
5spec:
6 services:
7 Frontend:
8 componentType: frontend
9 replicas: 1
10 extraPodSpec:
11 mainContainer:
12 image: registry.example.com/dynamo/vllm-runtime:1.0.0
13
14 VllmDecodeWorker:
15 componentType: worker
16 replicas: 1
17 checkpoint:
18 enabled: true
19 mode: Auto
20 startupPolicy: Immediate
21 extraPodSpec:
22 mainContainer:
23 image: registry.example.com/dynamo/vllm-placeholder:1.0.0
24 ...
25 ...

The legacy checkpoint.identity field is ignored for DGD-managed automatic checkpoints. It is retained only for API compatibility and standalone DynamoCheckpoint workflows.

Useful inspection commands:

$kubectl get dgd vllm-auto-demo -n ${NAMESPACE} \
> -o jsonpath='{.status.checkpoints.VllmDecodeWorker.checkpointName}{"\n"}{.status.checkpoints.VllmDecodeWorker.checkpointID}{"\n"}{.status.checkpoints.VllmDecodeWorker.ready}{"\n"}'
$
$kubectl get dckpt -n ${NAMESPACE}

If you use the default Immediate policy and want to create restored pods after the checkpoint becomes ready, scale the worker:

$kubectl patch dgd vllm-auto-demo -n ${NAMESPACE} --type=merge \
> -p '{"spec":{"services":{"VllmDecodeWorker":{"replicas":2}}}}'

Failover Restore

Failover restore is not yet available. The current Snapshot flow does not support GMS + Snapshot, so do not use failover restore as a supported checkpoint/restore path. For current GMS and active/passive failover guidance, see Shadow Engine Failover.

Lower-Level Testing With snapshotctl

It is possible to checkpoint and restore pods without the Dynamo operator via the lower-level snapshotctl utility. However, the snapshot helm chart must be installed, with a running snapshot-agent DaemonSet in the namespace with the checkpoint PVC mounted.

snapshotctl is intended for lower-level debugging and validation workflows, not as the primary user-facing checkpoint interface. For command details and manifest requirements, see deploy/snapshot/cmd/snapshotctl/README.md.

Checkpoint from a worker pod manifest

$snapshotctl checkpoint \
> --manifest ./worker-pod.yaml \
> --container main \
> --namespace ${NAMESPACE}

The checkpoint manifest must be for a pod and use a placeholder image. --container names the workload container to checkpoint.

If you do not pass --checkpoint-id, snapshotctl generates one and prints it:

status=completed
namespace=...
name=...
checkpoint_job=...
checkpoint_id=manual-snapshot-...
checkpoint_location=/checkpoints/...

Restore from a worker pod manifest

$snapshotctl restore \
> --manifest ./worker-pod.yaml \
> --namespace ${NAMESPACE} \
> --checkpoint-id manual-snapshot-... \
> --containers main

This creates a new restore pod and returns after the request is submitted. Observe progress through Kubernetes readiness, events, and logs.

Restore an existing pod in place

$snapshotctl restore \
> --pod existing-restore-target \
> --namespace ${NAMESPACE} \
> --checkpoint-id manual-snapshot-... \
> --containers main

This patches restore metadata onto an existing pod that is already snapshot-compatible and returns after the patch is accepted.

Checkpoint IDs and Legacy Identity

status.checkpointID is the artifact ID used by the snapshot protocol and the directory name under checkpoint storage. For DGD-managed automatic checkpoints, this ID is scoped to a single DGD/component worker generation. It is not a compatibility claim across DGDs, and identical manifests are not treated as proof that a checkpoint can be reused safely.

The legacy spec.identity shape is still required on standalone DynamoCheckpoint objects and remains the fallback for explicit/manual workflows. When a standalone checkpoint does not already have status.checkpointID or the checkpoint-ID label, the operator computes the legacy 16-character SHA256 hash (64 bits) from these fields:

Legacy fieldRequiredAffects legacy hashExample
model✓✓meta-llama/Llama-3-8B
backendFramework✓✓vllm
dynamoVersion✓0.9.0, 1.0.0
tensorParallelSize✓1, 2, 4, 8
pipelineParallelSize✓1, 2
dtype✓float16, bfloat16, fp8
maxModelLen✓4096, 8192
extraParameters✓custom key-value pairs

Fields that do not change the legacy hash include:

  • replica count
  • node placement (nodeSelector, affinity, tolerations)
  • resource requests/limits
  • logging or observability configuration

DGD-managed automatic checkpoints ignore this legacy identity as a reuse boundary. The DGD controller creates its own DGD-scoped checkpoint ID and synthesizes a legacy identity only because the v1alpha1 DynamoCheckpoint API still requires the field.

DynamoCheckpoint CRD

The DynamoCheckpoint (shortname: dckpt) is the operator-managed resource for checkpoint lifecycle.

Use it when you want:

  • pre-warmed checkpoints before any DynamoGraphDeployment exists
  • explicit lifecycle control independent from a DGD
  • a stable human-readable name that services can reference with checkpointRef

The operator requires:

  • spec.identity
  • spec.job.podTemplateSpec

spec.job.backoffLimit is deprecated and ignored. Checkpoint Jobs are always single-attempt.

Check status with:

$kubectl get dckpt -n ${NAMESPACE}
$kubectl describe dckpt qwen3-06b-bf16 -n ${NAMESPACE}
$kubectl get dckpt qwen3-06b-bf16 -n ${NAMESPACE} -o yaml

The status block looks like:

1status:
2 phase: Ready
3 checkpointID: 3bff874d069f0ed5
4 identityHash: 3bff874d069f0ed5 # deprecated compatibility alias
5 jobName: checkpoint-job-3bff874d069f0ed5-1
6 createdAt: "2026-01-29T10:05:00Z"
7 message: ""

Limitations

  • Backend support is limited: checkpoint/restore currently supports vLLM workers only, and that support is still a limited preview.
  • Worker coverage is narrow: specialized workers such as multimodal, embedding, and diffusion are not supported.
  • Multi-GPU remains preview: vLLM tensor-parallel configurations have limited validation and are not yet a broadly supported path across clusters.
  • GMS restore remains experimental: GMS + Snapshot is currently disabled.
  • Admission is create-only: with DGD startupPolicy: Immediate, only Pods created after a checkpoint is Ready are restore-shaped. Existing Pods cold-started before checkpoint readiness keep running as-is.
  • Network state is sensitive: restore is sensitive to live TCP socket state. Loopback bootstrap/control sockets are the most reliable path today.
  • Privileged DaemonSet required: snapshot-agent must run privileged to execute CRIU and cuda-checkpoint. Workload pods do not need to be privileged.

Troubleshooting

Checkpoint Job finishes but the checkpoint never becomes Ready

Snapshot only becomes Ready after snapshot-agent confirms the checkpoint contents. A completed Job is not enough by itself.

$kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} \
> -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,MESSAGE:.status.message,JOB:.status.jobName
$
$JOB_NAME=$(kubectl get dckpt <checkpoint-name> -n ${NAMESPACE} -o jsonpath='{.status.jobName}')
$if [ -n "${JOB_NAME}" ]; then
$ kubectl logs job/"${JOB_NAME}" -n ${NAMESPACE}
$fi
$
$kubectl logs daemonset/snapshot-agent -n ${NAMESPACE} --all-containers

If the worker template is wrong, the most common causes are using the raw runtime image instead of the placeholder image, or leaving out normal mounts and secrets that the worker needs to start.

Restore cannot find or mount checkpoint storage

For the default agentMount install, restore discovers checkpoint storage from the snapshot-agent DaemonSet in the workload namespace. That DaemonSet must be ready and must mount the checkpoint PVC.

$kubectl rollout status daemonset/snapshot-agent -n ${NAMESPACE}
$kubectl get daemonset -n ${NAMESPACE} -l app.kubernetes.io/component=snapshot-agent -o wide
$kubectl get pvc -n ${NAMESPACE}

For a shared-agent podMount install, the snapshot-agent DaemonSet can run in the infrastructure namespace instead. Verify the shared-agent pods there, then verify that the workload namespace has the checkpoint PVC that the operator created or validated:

$kubectl rollout status daemonset/snapshot-agent -n dynamo-system
$kubectl get pods -n dynamo-system -l app.kubernetes.io/component=snapshot-agent -o wide
$kubectl get pvc snapshot-pvc -n ${NAMESPACE}

In podMount mode the agent reaches the checkpoint through the workload pod’s mount namespace rather than by mounting the PVC itself. Check the workload pod’s checkpoint storage annotations and the snapshot-agent logs to see the actual resolved checkpoint path. snapshotctl uses the chart’s storage resolution path, so for lower-level snapshotctl debugging make sure the snapshot chart configuration matches the access mode you are testing.

snapshotctl manifest is rejected or the restore target is wrong

snapshotctl requires a Pod manifest and a target-container list. Multi-container manifests are supported as long as every name passed via --container or --containers exists in the pod spec.

$snapshotctl checkpoint --manifest ./worker-pod.yaml --container main --namespace ${NAMESPACE}
$snapshotctl restore --manifest ./worker-pod.yaml --containers main --namespace ${NAMESPACE} --checkpoint-id <checkpoint-id>

If the manifest already carries snapshot target metadata, it must agree with the CLI flag; snapshotctl rejects mismatches instead of silently picking one.

Planned Features

  • Stable multi-GPU and multinode support
  • TensorRT-LLM support

Related Documentation

  • Installation Guide
  • Shadow Engine Failover
  • API Reference