ChReK Standalone Usage Guide#
⚠️ Experimental Feature: ChReK is currently in beta/preview. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the security implications before deploying.
This guide explains how to use ChReK (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.
Table of Contents#
Overview#
When using ChReK standalone, you are responsible for:
Deploying the ChReK Helm chart (DaemonSet + PVC)
Building checkpoint-enabled container images with the restore entrypoint
Creating checkpoint jobs with the correct environment variables
Creating restore pods that detect and use the checkpoints
The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.
Prerequisites#
Kubernetes cluster with:
NVIDIA GPUs with checkpoint support
Privileged security context allowed (⚠️ required for CRIU - see Security Considerations)
PVC storage (ReadWriteMany recommended for multi-node)
Docker or compatible container runtime for building images
Access to the ChReK source code:
deploy/chrek/
Security Considerations#
⚠️ Important: ChReK restore operations require privileged mode, which has significant security implications:
Privileged containers can access all host devices and bypass most security restrictions
This may violate security policies in production environments
Privileged containers, if compromised, can potentially compromise node security
Recommended for:
✅ Development and testing environments
✅ Research and experimentation
✅ Controlled production environments with appropriate security controls
Not recommended for:
❌ Multi-tenant clusters without proper isolation
❌ Security-sensitive production workloads without risk assessment
❌ Environments with strict security compliance requirements
Technical Limitations#
⚠️ Current Restrictions:
vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
Single-node only: Checkpoints must be created and restored on the same node
Single-GPU only: Multi-GPU configurations are not yet supported
Network state: Active TCP connections are closed during restore
Storage: Only PVC backend currently implemented (S3/OCI planned)
Step 1: Deploy ChReK#
Install the Helm Chart#
# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
--namespace my-app \
--create-namespace \
--set storage.pvc.size=100Gi \
--set storage.pvc.storageClass=your-storage-class
Verify Installation#
# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE
# chrek-agent 3 3 3 3 3
# Check the PVC is bound
kubectl get pvc -n my-app
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# chrek-pvc Bound pvc-xyz 100Gi RWX your-storage-class
Step 2: Build Checkpoint-Enabled Images#
ChReK provides a convenient placeholder target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.
Quick Start: Using the Placeholder Target (Recommended)#
cd deploy/chrek
# Define your images
export BASE_IMAGE="your-app:latest" # Your existing application image
export RESTORE_IMAGE="your-app:checkpoint-enabled" # Output checkpoint-enabled image
# Build using the placeholder target
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$BASE_IMAGE" \
-t "$RESTORE_IMAGE" \
.
# Push to your registry
docker push "$RESTORE_IMAGE"
Example with a Dynamo vLLM image:
cd deploy/chrek
export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"
docker build \
--target placeholder \
--build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
-t "$RESTORE_IMAGE" \
.
What the Placeholder Target Does#
The ChReK Dockerfile’s placeholder stage automatically:
✅ Builds the restore-entrypoint binary
✅ Injects it into
/usr/local/bin/restore-entrypoint✅ Adds
smart-entrypoint.shto/usr/local/bin/✅ Sets executable permissions
✅ Configures the entrypoint to detect and restore checkpoints
✅ Preserves your original application CMD
Alternative: Manual Multi-Stage Build#
If you need more control, you can create your own Dockerfile:
# Stage 1: Build restore-entrypoint
FROM golang:1.23-alpine AS restore-builder
WORKDIR /build
COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
COPY deploy/chrek/pkg ./pkg
COPY deploy/chrek/go.mod deploy/chrek/go.sum ./
RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint
# Stage 2: Your application image
FROM your-base-image:latest
# Copy restore-entrypoint
COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint
# Copy smart-entrypoint.sh
COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint
# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]
# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]
💡 Tip: Using the
placeholdertarget is the recommended approach as it’s maintained with the ChReK codebase and ensures compatibility.
Step 3: Create Checkpoint Jobs#
A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.
Required Environment Variables#
Your checkpoint job MUST set these environment variables:
Variable |
Description |
Example |
|---|---|---|
|
Path where DaemonSet writes completion signal |
|
|
Path where your app signals it’s ready |
|
|
Unique identifier for this checkpoint |
|
|
Directory where checkpoint is stored |
|
|
Storage backend type |
|
Required Labels#
Add this label to enable DaemonSet checkpoint detection:
labels:
nvidia.com/checkpoint-source: "true"
Example Checkpoint Job#
apiVersion: batch/v1
kind: Job
metadata:
name: checkpoint-my-model
namespace: my-app
spec:
template:
metadata:
labels:
nvidia.com/checkpoint-source: "true" # Required for DaemonSet detection
spec:
restartPolicy: Never
# Init container to clean up stale signal files
initContainers:
- name: cleanup-signal-file
image: busybox:latest
command:
- sh
- -c
- |
rm -f /checkpoint-signal/my-checkpoint.done || true
echo "Signal file cleanup complete"
volumeMounts:
- name: checkpoint-signal
mountPath: /checkpoint-signal
containers:
- name: main
image: my-app:checkpoint-enabled
# Security context required for CRIU
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
# Readiness probe: Pod becomes Ready when model is loaded
# This is what triggers the DaemonSet to start checkpointing
readinessProbe:
exec:
command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
initialDelaySeconds: 15
periodSeconds: 2
# Remove liveness/startup probes for checkpoint jobs
# Model loading can take several minutes
livenessProbe: null
startupProbe: null
# Checkpoint-related environment variables
env:
- name: DYN_CHECKPOINT_SIGNAL_FILE
value: "/checkpoint-signal/my-checkpoint.done"
- name: DYN_CHECKPOINT_READY_FILE
value: "/tmp/checkpoint-ready"
- name: DYN_CHECKPOINT_HASH
value: "abc123def456"
- name: DYN_CHECKPOINT_LOCATION
value: "/checkpoints/abc123def456"
- name: DYN_CHECKPOINT_STORAGE_TYPE
value: "pvc"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Required volume mounts
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
- name: checkpoint-signal
mountPath: /checkpoint-signal
- name: tmp
mountPath: /tmp
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
- name: checkpoint-signal
hostPath:
path: /var/lib/chrek/signals
type: DirectoryOrCreate
- name: tmp
emptyDir: {}
Application Code Requirements#
Your application must implement the checkpoint flow. Here’s the pattern used by Dynamo vLLM:
import os
import time
def main():
# 1. Check for checkpoint mode
signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE")
restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored")
is_checkpoint_mode = signal_file is not None
if is_checkpoint_mode:
print("Checkpoint mode detected")
# 2. Load your model/application
model = load_model()
# 3. Optional: Put model to sleep to reduce memory footprint
# model.sleep()
# 4. Write ready file (for application use, not DaemonSet)
if ready_file:
with open(ready_file, "w") as f:
f.write("ready")
print(f"Wrote checkpoint ready file: {ready_file}")
# 5. Log readiness messages (helps debugging)
print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")
# 6. Wait for checkpoint completion OR restore detection
while True:
# Check if we've been restored (marker file created by restore entrypoint)
if os.path.exists(restore_marker):
print(f"Detected restore from checkpoint (marker: {restore_marker})")
# Continue with normal application flow
break
# Check if checkpoint is complete (signal file created by DaemonSet)
if os.path.exists(signal_file):
print(f"Checkpoint signal file detected: {signal_file}")
print("Checkpoint complete, exiting")
return # Exit gracefully
time.sleep(1)
# Normal application flow (or post-restore flow)
run_application()
Important Notes:
Ready File & Readiness Probe: The checkpoint job must have a readiness probe that checks for the ready file:
readinessProbe: exec: command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"] initialDelaySeconds: 15 periodSeconds: 2
The ChReK DaemonSet triggers checkpointing when:
Pod has
nvidia.com/checkpoint-source: "true"labelPod status is
Ready(readiness probe passes = ready file exists)
Restore Marker: Created by
restore-entrypointbefore CRIU restore, allows the restored process to detect it was restoredTwo Exit Paths:
Signal file found: Checkpoint complete, exit gracefully
Restore marker found: Process was restored, continue running
Step 4: Restore from Checkpoints#
Restore pods automatically detect and restore from checkpoints if they exist.
Example Restore Pod#
apiVersion: v1
kind: Pod
metadata:
name: my-app-restored
namespace: my-app
spec:
restartPolicy: Never
containers:
- name: main
image: my-app:checkpoint-enabled
# Security context required for CRIU restore
securityContext:
privileged: true
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]
# Set checkpoint environment variables
env:
- name: DYN_CHECKPOINT_HASH
value: "abc123def456" # Must match checkpoint job
- name: DYN_CHECKPOINT_PATH
value: "/checkpoints" # Base path (hash appended automatically)
# Optional: Customize restore marker file path
# - name: DYN_RESTORE_MARKER_FILE
# value: "/tmp/dynamo-restored"
# GPU request
resources:
limits:
nvidia.com/gpu: 1
# Mount checkpoint storage (READ-ONLY is fine for restore)
volumeMounts:
- name: checkpoint-storage
mountPath: /checkpoints
readOnly: true
- name: checkpoint-signal
mountPath: /checkpoint-signal
volumes:
- name: checkpoint-storage
persistentVolumeClaim:
claimName: chrek-pvc
- name: checkpoint-signal
hostPath:
path: /var/lib/chrek/signals
type: DirectoryOrCreate
How Restore Works#
Smart Entrypoint Detects Checkpoint: The
smart-entrypoint.shchecks if a checkpoint exists at/checkpoints/${DYN_CHECKPOINT_HASH}/Calls Restore Entrypoint: If found, calls
/usr/local/bin/restore-entrypointwhich invokes CRIUCRIU Restores Process: The entire process tree is restored from the checkpoint, including GPU state
Application Continues: Your application resumes exactly where it was checkpointed
Environment Variables Reference#
Checkpoint Jobs#
Variable |
Required |
Description |
|---|---|---|
|
Yes |
Full path to signal file (e.g., |
|
Yes |
Full path where app signals readiness (e.g., |
|
Yes |
Unique checkpoint identifier (alphanumeric string) |
|
Yes |
Directory where checkpoint is stored (e.g., |
|
Yes |
Storage backend: |
Restore Pods#
Variable |
Required |
Description |
|---|---|---|
|
Yes |
Checkpoint identifier (must match checkpoint job) |
|
Yes |
Base checkpoint directory (hash appended automatically) |
|
No |
Path for restore marker file (default: |
Optional CRIU Tuning (Advanced)#
Variable |
Default |
Description |
|---|---|---|
|
|
CRIU operation timeout in seconds |
|
|
CRIU log verbosity (0-4) |
|
|
CRIU working directory |
|
|
Path to CRIU CUDA plugin |
|
|
Skip in-flight TCP connections |
|
|
Enable auto-deduplication |
|
|
Enable lazy page migration (experimental) |
|
|
Wait for checkpoint to appear before starting |
|
|
Max seconds to wait for checkpoint |
|
|
Enable debug mode (sleeps 300s on error) |
Checkpoint Flow Explained#
1. Checkpoint Creation Flow#
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/checkpoint-source=true label │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file │
│ /tmp/checkpoint-ready │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects: │
│ - Pod is Ready │
│ - Has checkpoint-source label │
│ - Ready file exists: /tmp/checkpoint-ready │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint via runc: │
│ - Freezes container process │
│ - Dumps memory (CPU + GPU) │
│ - Saves to /checkpoints/${HASH}/ │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet writes signal file: │
│ /checkpoint-signal/${HASH}.done │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 7. Application detects signal file and exits gracefully │
└─────────────────────────────────────────────────────────────┘
2. Restore Flow#
┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with DYN_CHECKPOINT_HASH set │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. smart-entrypoint.sh checks for checkpoint: │
│ /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done │
└──────────────────────┬──────────────────────────────────────┘
│
├─ Not Found ─────────────────┐
│ │
▼ ▼
┌───────────────────────┐ ┌──────────────────────┐
│ Checkpoint exists │ │ Cold start │
└──────────┬────────────┘ │ Run original CMD │
│ └──────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Call restore-entrypoint with checkpoint path │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. restore-entrypoint extracts checkpoint and calls CRIU: │
│ criu restore --images-dir /checkpoints/${HASH}/images │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. CRIU restores process from checkpoint │
│ - Restores memory (CPU + GPU) │
│ - Restores file descriptors │
│ - Resumes process execution │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 6. Application continues from checkpointed state │
│ (Model already loaded, GPU memory initialized) │
└─────────────────────────────────────────────────────────────┘
Troubleshooting#
Checkpoint Not Created#
Symptom: Job runs but no checkpoint appears in /checkpoints/
Checks:
Verify the pod has the label:
kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
Check pod readiness:
kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
Check ready file was created:
kubectl exec <pod-name> -- ls -la /tmp/checkpoint-ready
Check DaemonSet logs:
kubectl logs -n my-app daemonset/chrek-agent --all-containers
Restore Fails#
Symptom: Pod fails to restore from checkpoint
Checks:
Verify checkpoint files exist:
kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
Check privileged mode is enabled:
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
Check CRIU logs in
/tmp/criu-restore.log:kubectl exec <pod-name> -- cat /tmp/criu-restore.log
Ensure checkpoint and restore have same:
Container image
GPU count
Volume mounts
Environment variables (except POD_NAME, POD_IP, etc.)
Permission Denied Errors#
Symptom: CRIU: Permission denied or Operation not permitted
Solution: Ensure pod has:
securityContext:
privileged: true
capabilities:
add:
- SYS_ADMIN
- SYS_PTRACE
- SYS_CHROOT
Signal File Not Appearing#
Symptom: Application waits forever for signal file
Checks:
Verify hostPath mount is correct:
kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
Check DaemonSet has access to the same path:
kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
Verify paths match exactly:
Pod:
/var/lib/chrek/signalsDaemonSet:
/var/lib/chrek/signals
Additional Resources#
Getting Help#
If you encounter issues:
Check the Troubleshooting section
Review DaemonSet logs:
kubectl logs -n <namespace> daemonset/chrek-agentOpen an issue on GitHub