ChReK Standalone Usage Guide#

⚠️ Experimental Feature: ChReK is currently in beta/preview. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the security implications before deploying.

This guide explains how to use ChReK (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.

Overview#

When using ChReK standalone, you are responsible for:

Deploying the ChReK Helm chart (DaemonSet + PVC)
Building checkpoint-enabled container images with the restore entrypoint
Creating checkpoint jobs with the correct environment variables
Creating restore pods that detect and use the checkpoints

The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.

Prerequisites#

Kubernetes cluster with:
- NVIDIA GPUs with checkpoint support
- Privileged security context allowed (⚠️ required for CRIU - see Security Considerations)
- PVC storage (ReadWriteMany recommended for multi-node)
Docker or compatible container runtime for building images
Access to the ChReK source code: deploy/chrek/

Security Considerations#

⚠️ Important: ChReK restore operations require privileged mode, which has significant security implications:

Privileged containers can access all host devices and bypass most security restrictions
This may violate security policies in production environments
Privileged containers, if compromised, can potentially compromise node security

Recommended for:

✅ Development and testing environments
✅ Research and experimentation
✅ Controlled production environments with appropriate security controls

Not recommended for:

❌ Multi-tenant clusters without proper isolation
❌ Security-sensitive production workloads without risk assessment
❌ Environments with strict security compliance requirements

Technical Limitations#

⚠️ Current Restrictions:

vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.
Single-node only: Checkpoints must be created and restored on the same node
Single-GPU only: Multi-GPU configurations are not yet supported
Network state: Active TCP connections are closed during restore
Storage: Only PVC backend currently implemented (S3/OCI planned)

Step 1: Deploy ChReK#

Install the Helm Chart#

# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo

# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
  --namespace my-app \
  --create-namespace \
  --set storage.pvc.size=100Gi \
  --set storage.pvc.storageClass=your-storage-class

Verify Installation#

# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
# chrek-agent   3         3         3       3            3

# Check the PVC is bound
kubectl get pvc -n my-app
# NAME        STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
# chrek-pvc   Bound    pvc-xyz    100Gi      RWX            your-storage-class

Step 2: Build Checkpoint-Enabled Images#

ChReK provides a convenient placeholder target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.

Quick Start: Using the Placeholder Target (Recommended)#

cd deploy/chrek

# Define your images
export BASE_IMAGE="your-app:latest"           # Your existing application image
export RESTORE_IMAGE="your-app:checkpoint-enabled"  # Output checkpoint-enabled image

# Build using the placeholder target
docker build \
  --target placeholder \
  --build-arg BASE_IMAGE="$BASE_IMAGE" \
  -t "$RESTORE_IMAGE" \
  .

# Push to your registry
docker push "$RESTORE_IMAGE"

Example with a Dynamo vLLM image:

cd deploy/chrek

export DYNAMO_IMAGE="nvidia/dynamo-vllm:v1.2.0"
export RESTORE_IMAGE="nvidia/dynamo-vllm:v1.2.0-checkpoint"

docker build \
  --target placeholder \
  --build-arg BASE_IMAGE="$DYNAMO_IMAGE" \
  -t "$RESTORE_IMAGE" \
  .

What the Placeholder Target Does#

The ChReK Dockerfile’s placeholder stage automatically:

✅ Builds the restore-entrypoint binary
✅ Injects it into /usr/local/bin/restore-entrypoint
✅ Adds smart-entrypoint.sh to /usr/local/bin/
✅ Sets executable permissions
✅ Configures the entrypoint to detect and restore checkpoints
✅ Preserves your original application CMD

Alternative: Manual Multi-Stage Build#

If you need more control, you can create your own Dockerfile:

# Stage 1: Build restore-entrypoint
FROM golang:1.23-alpine AS restore-builder
WORKDIR /build
COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
COPY deploy/chrek/pkg ./pkg
COPY deploy/chrek/go.mod deploy/chrek/go.sum ./

RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint

# Stage 2: Your application image
FROM your-base-image:latest

# Copy restore-entrypoint
COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint

# Copy smart-entrypoint.sh
COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint

# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]

# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]

💡 Tip: Using the placeholder target is the recommended approach as it’s maintained with the ChReK codebase and ensures compatibility.

Step 3: Create Checkpoint Jobs#

A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.

Required Environment Variables#

Your checkpoint job MUST set these environment variables:

Variable	Description	Example
`DYN_CHECKPOINT_SIGNAL_FILE`	Path where DaemonSet writes completion signal	`/checkpoint-signal/my-checkpoint.done`
`DYN_CHECKPOINT_READY_FILE`	Path where your app signals it’s ready	`/tmp/checkpoint-ready`
`DYN_CHECKPOINT_HASH`	Unique identifier for this checkpoint	`abc123def456`
`DYN_CHECKPOINT_LOCATION`	Directory where checkpoint is stored	`/checkpoints/abc123def456`
`DYN_CHECKPOINT_STORAGE_TYPE`	Storage backend type	`pvc`

Required Labels#

Add this label to enable DaemonSet checkpoint detection:

labels:
  nvidia.com/checkpoint-source: "true"

Example Checkpoint Job#

apiVersion: batch/v1
kind: Job
metadata:
  name: checkpoint-my-model
  namespace: my-app
spec:
  template:
    metadata:
      labels:
        nvidia.com/checkpoint-source: "true"  # Required for DaemonSet detection
    spec:
      restartPolicy: Never

      # Init container to clean up stale signal files
      initContainers:
      - name: cleanup-signal-file
        image: busybox:latest
        command:
        - sh
        - -c
        - |
          rm -f /checkpoint-signal/my-checkpoint.done || true
          echo "Signal file cleanup complete"
        volumeMounts:
        - name: checkpoint-signal
          mountPath: /checkpoint-signal

      containers:
      - name: main
        image: my-app:checkpoint-enabled

        # Security context required for CRIU
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]

        # Readiness probe: Pod becomes Ready when model is loaded
        # This is what triggers the DaemonSet to start checkpointing
        readinessProbe:
          exec:
            command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
          initialDelaySeconds: 15
          periodSeconds: 2

        # Remove liveness/startup probes for checkpoint jobs
        # Model loading can take several minutes
        livenessProbe: null
        startupProbe: null

        # Checkpoint-related environment variables
        env:
        - name: DYN_CHECKPOINT_SIGNAL_FILE
          value: "/checkpoint-signal/my-checkpoint.done"
        - name: DYN_CHECKPOINT_READY_FILE
          value: "/tmp/checkpoint-ready"
        - name: DYN_CHECKPOINT_HASH
          value: "abc123def456"
        - name: DYN_CHECKPOINT_LOCATION
          value: "/checkpoints/abc123def456"
        - name: DYN_CHECKPOINT_STORAGE_TYPE
          value: "pvc"

        # GPU request
        resources:
          limits:
            nvidia.com/gpu: 1

        # Required volume mounts
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /checkpoints
        - name: checkpoint-signal
          mountPath: /checkpoint-signal
        - name: tmp
          mountPath: /tmp

      volumes:
      - name: checkpoint-storage
        persistentVolumeClaim:
          claimName: chrek-pvc
      - name: checkpoint-signal
        hostPath:
          path: /var/lib/chrek/signals
          type: DirectoryOrCreate
      - name: tmp
        emptyDir: {}

Application Code Requirements#

Your application must implement the checkpoint flow. Here’s the pattern used by Dynamo vLLM:

import os
import time

def main():
    # 1. Check for checkpoint mode
    signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
    ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE")
    restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored")

    is_checkpoint_mode = signal_file is not None

    if is_checkpoint_mode:
        print("Checkpoint mode detected")

        # 2. Load your model/application
        model = load_model()

        # 3. Optional: Put model to sleep to reduce memory footprint
        # model.sleep()

        # 4. Write ready file (for application use, not DaemonSet)
        if ready_file:
            with open(ready_file, "w") as f:
                f.write("ready")
            print(f"Wrote checkpoint ready file: {ready_file}")

        # 5. Log readiness messages (helps debugging)
        print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
        print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
        print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")

        # 6. Wait for checkpoint completion OR restore detection
        while True:
            # Check if we've been restored (marker file created by restore entrypoint)
            if os.path.exists(restore_marker):
                print(f"Detected restore from checkpoint (marker: {restore_marker})")
                # Continue with normal application flow
                break

            # Check if checkpoint is complete (signal file created by DaemonSet)
            if os.path.exists(signal_file):
                print(f"Checkpoint signal file detected: {signal_file}")
                print("Checkpoint complete, exiting")
                return  # Exit gracefully

            time.sleep(1)

    # Normal application flow (or post-restore flow)
    run_application()

Important Notes:

Ready File & Readiness Probe: The checkpoint job must have a readiness probe that checks for the ready file:
```
readinessProbe:
  exec:
    command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
  initialDelaySeconds: 15
  periodSeconds: 2
```
The ChReK DaemonSet triggers checkpointing when:
- Pod has nvidia.com/checkpoint-source: "true" label
- Pod status is Ready (readiness probe passes = ready file exists)
Restore Marker: Created by restore-entrypoint before CRIU restore, allows the restored process to detect it was restored
Two Exit Paths:
- Signal file found: Checkpoint complete, exit gracefully
- Restore marker found: Process was restored, continue running

Step 4: Restore from Checkpoints#

Restore pods automatically detect and restore from checkpoints if they exist.

Example Restore Pod#

apiVersion: v1
kind: Pod
metadata:
  name: my-app-restored
  namespace: my-app
spec:
  restartPolicy: Never

  containers:
  - name: main
    image: my-app:checkpoint-enabled

    # Security context required for CRIU restore
    securityContext:
      privileged: true
      capabilities:
        add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]

    # Set checkpoint environment variables
    env:
    - name: DYN_CHECKPOINT_HASH
      value: "abc123def456"  # Must match checkpoint job
    - name: DYN_CHECKPOINT_PATH
      value: "/checkpoints"  # Base path (hash appended automatically)

    # Optional: Customize restore marker file path
    # - name: DYN_RESTORE_MARKER_FILE
    #   value: "/tmp/dynamo-restored"

    # GPU request
    resources:
      limits:
        nvidia.com/gpu: 1

    # Mount checkpoint storage (READ-ONLY is fine for restore)
    volumeMounts:
    - name: checkpoint-storage
      mountPath: /checkpoints
      readOnly: true
    - name: checkpoint-signal
      mountPath: /checkpoint-signal

  volumes:
  - name: checkpoint-storage
    persistentVolumeClaim:
      claimName: chrek-pvc
  - name: checkpoint-signal
    hostPath:
      path: /var/lib/chrek/signals
      type: DirectoryOrCreate

How Restore Works#

Smart Entrypoint Detects Checkpoint: The smart-entrypoint.sh checks if a checkpoint exists at /checkpoints/${DYN_CHECKPOINT_HASH}/
Calls Restore Entrypoint: If found, calls /usr/local/bin/restore-entrypoint which invokes CRIU
CRIU Restores Process: The entire process tree is restored from the checkpoint, including GPU state
Application Continues: Your application resumes exactly where it was checkpointed

Environment Variables Reference#

Checkpoint Jobs#

Variable	Required	Description
`DYN_CHECKPOINT_SIGNAL_FILE`	Yes	Full path to signal file (e.g., `/checkpoint-signal/my-checkpoint.done`)
`DYN_CHECKPOINT_READY_FILE`	Yes	Full path where app signals readiness (e.g., `/tmp/checkpoint-ready`)
`DYN_CHECKPOINT_HASH`	Yes	Unique checkpoint identifier (alphanumeric string)
`DYN_CHECKPOINT_LOCATION`	Yes	Directory where checkpoint is stored (e.g., `/checkpoints/abc123`)
`DYN_CHECKPOINT_STORAGE_TYPE`	Yes	Storage backend: `pvc`, `s3`, or `oci`

Restore Pods#

Variable	Required	Description
`DYN_CHECKPOINT_HASH`	Yes	Checkpoint identifier (must match checkpoint job)
`DYN_CHECKPOINT_PATH`	Yes	Base checkpoint directory (hash appended automatically)
`DYN_RESTORE_MARKER_FILE`	No	Path for restore marker file (default: `/tmp/dynamo-restored`)

Optional CRIU Tuning (Advanced)#

Variable	Default	Description
`CRIU_TIMEOUT`	`0` (unlimited)	CRIU operation timeout in seconds
`CRIU_LOG_LEVEL`	`4`	CRIU log verbosity (0-4)
`CRIU_WORK_DIR`	`/tmp`	CRIU working directory
`CUDA_PLUGIN_DIR`	`/usr/local/lib/criu`	Path to CRIU CUDA plugin
`CRIU_SKIP_IN_FLIGHT`	`false`	Skip in-flight TCP connections
`CRIU_AUTO_DEDUP`	`false`	Enable auto-deduplication
`CRIU_LAZY_PAGES`	`false`	Enable lazy page migration (experimental)
`WAIT_FOR_CHECKPOINT`	`false`	Wait for checkpoint to appear before starting
`RESTORE_WAIT_TIMEOUT`	`300`	Max seconds to wait for checkpoint
`DEBUG`	`false`	Enable debug mode (sleeps 300s on error)

Checkpoint Flow Explained#

1. Checkpoint Creation Flow#

┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/checkpoint-source=true label  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file           │
│    /tmp/checkpoint-ready                                     │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes)       │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects:                                 │
│    - Pod is Ready                                            │
│    - Has checkpoint-source label                             │
│    - Ready file exists: /tmp/checkpoint-ready               │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint via runc:             │
│    - Freezes container process                               │
│    - Dumps memory (CPU + GPU)                                │
│    - Saves to /checkpoints/${HASH}/                          │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet writes signal file:                            │
│    /checkpoint-signal/${HASH}.done                           │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 7. Application detects signal file and exits gracefully     │
└─────────────────────────────────────────────────────────────┘

2. Restore Flow#

┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with DYN_CHECKPOINT_HASH set                  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. smart-entrypoint.sh checks for checkpoint:               │
│    /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ├─ Not Found ─────────────────┐
                       │                              │
                       ▼                              ▼
           ┌───────────────────────┐    ┌──────────────────────┐
           │ Checkpoint exists     │    │ Cold start           │
           └──────────┬────────────┘    │ Run original CMD     │
                      │                 └──────────────────────┘
                      ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Call restore-entrypoint with checkpoint path             │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. restore-entrypoint extracts checkpoint and calls CRIU:   │
│    criu restore --images-dir /checkpoints/${HASH}/images    │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. CRIU restores process from checkpoint                    │
│    - Restores memory (CPU + GPU)                             │
│    - Restores file descriptors                               │
│    - Resumes process execution                               │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. Application continues from checkpointed state            │
│    (Model already loaded, GPU memory initialized)           │
└─────────────────────────────────────────────────────────────┘

Troubleshooting#

Checkpoint Not Created#

Symptom: Job runs but no checkpoint appears in /checkpoints/

Checks:

Verify the pod has the label:

kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'

Check pod readiness:

kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

Check ready file was created:

kubectl exec <pod-name> -- ls -la /tmp/checkpoint-ready

Check DaemonSet logs:

kubectl logs -n my-app daemonset/chrek-agent --all-containers

Restore Fails#

Symptom: Pod fails to restore from checkpoint

Checks:

Verify checkpoint files exist:

kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/

Check privileged mode is enabled:

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'

Check CRIU logs in /tmp/criu-restore.log:

kubectl exec <pod-name> -- cat /tmp/criu-restore.log

Ensure checkpoint and restore have same:
- Container image
- GPU count
- Volume mounts
- Environment variables (except POD_NAME, POD_IP, etc.)

Permission Denied Errors#

Symptom: CRIU: Permission denied or Operation not permitted

Solution: Ensure pod has:

securityContext:
  privileged: true
  capabilities:
    add:
    - SYS_ADMIN
    - SYS_PTRACE
    - SYS_CHROOT

Signal File Not Appearing#

Symptom: Application waits forever for signal file

Checks:

Verify hostPath mount is correct:

kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'

Check DaemonSet has access to the same path:

kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'

Verify paths match exactly:
- Pod: /var/lib/chrek/signals
- DaemonSet: /var/lib/chrek/signals

Additional Resources#

Getting Help#

If you encounter issues:

Check the Troubleshooting section
Review DaemonSet logs: kubectl logs -n <namespace> daemonset/chrek-agent
Open an issue on GitHub