ChReK Standalone Usage Guide#

⚠️ Experimental Feature: ChReK is currently in beta/preview. It requires privileged mode for restore operations, which may not be suitable for all production environments. Review the security implications before deploying.

This guide explains how to use ChReK (Checkpoint/Restore for Kubernetes) as a standalone component without deploying the full Dynamo platform. This is useful if you want to add checkpoint/restore capabilities to your own GPU workloads.

Table of Contents#


Overview#

When using ChReK standalone, you are responsible for:

  1. Deploying the ChReK Helm chart (DaemonSet + PVC)

  2. Building checkpoint-enabled container images with the restore entrypoint

  3. Creating checkpoint jobs with the correct environment variables

  4. Creating restore pods that detect and use the checkpoints

The ChReK DaemonSet handles the actual CRIU checkpoint/restore operations automatically once your pods are configured correctly.


Prerequisites#

  • Kubernetes cluster with:

    • NVIDIA GPUs with checkpoint support

    • Privileged security context allowed (⚠️ required for CRIU - see Security Considerations)

    • PVC storage (ReadWriteMany recommended for multi-node)

  • Docker or compatible container runtime for building images

  • Access to the ChReK source code: deploy/chrek/

Security Considerations#

⚠️ Important: ChReK restore operations require privileged mode, which has significant security implications:

  • Privileged containers can access all host devices and bypass most security restrictions

  • This may violate security policies in production environments

  • Privileged containers, if compromised, can potentially compromise node security

Recommended for:

  • ✅ Development and testing environments

  • ✅ Research and experimentation

  • ✅ Controlled production environments with appropriate security controls

Not recommended for:

  • ❌ Multi-tenant clusters without proper isolation

  • ❌ Security-sensitive production workloads without risk assessment

  • ❌ Environments with strict security compliance requirements

Technical Limitations#

⚠️ Current Restrictions:

  • vLLM backend only: Currently only the vLLM backend supports checkpoint/restore. SGLang and TensorRT-LLM support is planned.

  • Single-node only: Checkpoints must be created and restored on the same node

  • Single-GPU only: Multi-GPU configurations are not yet supported

  • Network state: Active TCP connections are closed during restore

  • Storage: Only PVC backend currently implemented (S3/OCI planned)


Step 1: Deploy ChReK#

Install the Helm Chart#

# Clone the repository
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo

# Install ChReK in your namespace
helm install chrek ./deploy/helm/charts/chrek \
  --namespace my-app \
  --create-namespace \
  --set storage.pvc.size=100Gi \
  --set storage.pvc.storageClass=your-storage-class

Verify Installation#

# Check the DaemonSet is running
kubectl get daemonset -n my-app
# NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE
# chrek-agent   3         3         3       3            3

# Check the PVC is bound
kubectl get pvc -n my-app
# NAME        STATUS   VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS
# chrek-pvc   Bound    pvc-xyz    100Gi      RWX            your-storage-class

Step 2: Build Checkpoint-Enabled Images#

ChReK provides a convenient placeholder target in its Dockerfile that automatically injects checkpoint/restore capabilities into your existing container images.

What the Placeholder Target Does#

The ChReK Dockerfile’s placeholder stage automatically:

  • ✅ Builds the restore-entrypoint binary

  • ✅ Injects it into /usr/local/bin/restore-entrypoint

  • ✅ Adds smart-entrypoint.sh to /usr/local/bin/

  • ✅ Sets executable permissions

  • ✅ Configures the entrypoint to detect and restore checkpoints

  • ✅ Preserves your original application CMD

Alternative: Manual Multi-Stage Build#

If you need more control, you can create your own Dockerfile:

# Stage 1: Build restore-entrypoint
FROM golang:1.23-alpine AS restore-builder
WORKDIR /build
COPY deploy/chrek/cmd/restore-entrypoint ./cmd/restore-entrypoint
COPY deploy/chrek/pkg ./pkg
COPY deploy/chrek/go.mod deploy/chrek/go.sum ./

RUN go build -o /restore-entrypoint ./cmd/restore-entrypoint

# Stage 2: Your application image
FROM your-base-image:latest

# Copy restore-entrypoint
COPY --from=restore-builder /restore-entrypoint /usr/local/bin/restore-entrypoint

# Copy smart-entrypoint.sh
COPY deploy/chrek/scripts/smart-entrypoint.sh /usr/local/bin/smart-entrypoint.sh
RUN chmod +x /usr/local/bin/smart-entrypoint.sh /usr/local/bin/restore-entrypoint

# Set smart-entrypoint as the default entrypoint
ENTRYPOINT ["/usr/local/bin/smart-entrypoint.sh"]

# Your application command (becomes CMD, can be overridden)
CMD ["python", "your_app.py"]

💡 Tip: Using the placeholder target is the recommended approach as it’s maintained with the ChReK codebase and ensures compatibility.


Step 3: Create Checkpoint Jobs#

A checkpoint job loads your application, waits for the ChReK DaemonSet to checkpoint it, and then exits.

Required Environment Variables#

Your checkpoint job MUST set these environment variables:

Variable

Description

Example

DYN_CHECKPOINT_SIGNAL_FILE

Path where DaemonSet writes completion signal

/checkpoint-signal/my-checkpoint.done

DYN_CHECKPOINT_READY_FILE

Path where your app signals it’s ready

/tmp/checkpoint-ready

DYN_CHECKPOINT_HASH

Unique identifier for this checkpoint

abc123def456

DYN_CHECKPOINT_LOCATION

Directory where checkpoint is stored

/checkpoints/abc123def456

DYN_CHECKPOINT_STORAGE_TYPE

Storage backend type

pvc

Required Labels#

Add this label to enable DaemonSet checkpoint detection:

labels:
  nvidia.com/checkpoint-source: "true"

Example Checkpoint Job#

apiVersion: batch/v1
kind: Job
metadata:
  name: checkpoint-my-model
  namespace: my-app
spec:
  template:
    metadata:
      labels:
        nvidia.com/checkpoint-source: "true"  # Required for DaemonSet detection
    spec:
      restartPolicy: Never

      # Init container to clean up stale signal files
      initContainers:
      - name: cleanup-signal-file
        image: busybox:latest
        command:
        - sh
        - -c
        - |
          rm -f /checkpoint-signal/my-checkpoint.done || true
          echo "Signal file cleanup complete"
        volumeMounts:
        - name: checkpoint-signal
          mountPath: /checkpoint-signal

      containers:
      - name: main
        image: my-app:checkpoint-enabled

        # Security context required for CRIU
        securityContext:
          privileged: true
          capabilities:
            add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]

        # Readiness probe: Pod becomes Ready when model is loaded
        # This is what triggers the DaemonSet to start checkpointing
        readinessProbe:
          exec:
            command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
          initialDelaySeconds: 15
          periodSeconds: 2

        # Remove liveness/startup probes for checkpoint jobs
        # Model loading can take several minutes
        livenessProbe: null
        startupProbe: null

        # Checkpoint-related environment variables
        env:
        - name: DYN_CHECKPOINT_SIGNAL_FILE
          value: "/checkpoint-signal/my-checkpoint.done"
        - name: DYN_CHECKPOINT_READY_FILE
          value: "/tmp/checkpoint-ready"
        - name: DYN_CHECKPOINT_HASH
          value: "abc123def456"
        - name: DYN_CHECKPOINT_LOCATION
          value: "/checkpoints/abc123def456"
        - name: DYN_CHECKPOINT_STORAGE_TYPE
          value: "pvc"

        # GPU request
        resources:
          limits:
            nvidia.com/gpu: 1

        # Required volume mounts
        volumeMounts:
        - name: checkpoint-storage
          mountPath: /checkpoints
        - name: checkpoint-signal
          mountPath: /checkpoint-signal
        - name: tmp
          mountPath: /tmp

      volumes:
      - name: checkpoint-storage
        persistentVolumeClaim:
          claimName: chrek-pvc
      - name: checkpoint-signal
        hostPath:
          path: /var/lib/chrek/signals
          type: DirectoryOrCreate
      - name: tmp
        emptyDir: {}

Application Code Requirements#

Your application must implement the checkpoint flow. Here’s the pattern used by Dynamo vLLM:

import os
import time

def main():
    # 1. Check for checkpoint mode
    signal_file = os.environ.get("DYN_CHECKPOINT_SIGNAL_FILE")
    ready_file = os.environ.get("DYN_CHECKPOINT_READY_FILE")
    restore_marker = os.environ.get("DYN_RESTORE_MARKER_FILE", "/tmp/dynamo-restored")

    is_checkpoint_mode = signal_file is not None

    if is_checkpoint_mode:
        print("Checkpoint mode detected")

        # 2. Load your model/application
        model = load_model()

        # 3. Optional: Put model to sleep to reduce memory footprint
        # model.sleep()

        # 4. Write ready file (for application use, not DaemonSet)
        if ready_file:
            with open(ready_file, "w") as f:
                f.write("ready")
            print(f"Wrote checkpoint ready file: {ready_file}")

        # 5. Log readiness messages (helps debugging)
        print("CHECKPOINT_READY: Model loaded, ready for container checkpoint")
        print(f"CHECKPOINT_READY: Waiting for signal file: {signal_file}")
        print(f"CHECKPOINT_READY: Or restore marker file: {restore_marker}")

        # 6. Wait for checkpoint completion OR restore detection
        while True:
            # Check if we've been restored (marker file created by restore entrypoint)
            if os.path.exists(restore_marker):
                print(f"Detected restore from checkpoint (marker: {restore_marker})")
                # Continue with normal application flow
                break

            # Check if checkpoint is complete (signal file created by DaemonSet)
            if os.path.exists(signal_file):
                print(f"Checkpoint signal file detected: {signal_file}")
                print("Checkpoint complete, exiting")
                return  # Exit gracefully

            time.sleep(1)

    # Normal application flow (or post-restore flow)
    run_application()

Important Notes:

  1. Ready File & Readiness Probe: The checkpoint job must have a readiness probe that checks for the ready file:

    readinessProbe:
      exec:
        command: ["sh", "-c", "cat ${DYN_CHECKPOINT_READY_FILE}"]
      initialDelaySeconds: 15
      periodSeconds: 2
    

    The ChReK DaemonSet triggers checkpointing when:

    • Pod has nvidia.com/checkpoint-source: "true" label

    • Pod status is Ready (readiness probe passes = ready file exists)

  2. Restore Marker: Created by restore-entrypoint before CRIU restore, allows the restored process to detect it was restored

  3. Two Exit Paths:

    • Signal file found: Checkpoint complete, exit gracefully

    • Restore marker found: Process was restored, continue running


Step 4: Restore from Checkpoints#

Restore pods automatically detect and restore from checkpoints if they exist.

Example Restore Pod#

apiVersion: v1
kind: Pod
metadata:
  name: my-app-restored
  namespace: my-app
spec:
  restartPolicy: Never

  containers:
  - name: main
    image: my-app:checkpoint-enabled

    # Security context required for CRIU restore
    securityContext:
      privileged: true
      capabilities:
        add: ["SYS_ADMIN", "SYS_PTRACE", "SYS_CHROOT"]

    # Set checkpoint environment variables
    env:
    - name: DYN_CHECKPOINT_HASH
      value: "abc123def456"  # Must match checkpoint job
    - name: DYN_CHECKPOINT_PATH
      value: "/checkpoints"  # Base path (hash appended automatically)

    # Optional: Customize restore marker file path
    # - name: DYN_RESTORE_MARKER_FILE
    #   value: "/tmp/dynamo-restored"

    # GPU request
    resources:
      limits:
        nvidia.com/gpu: 1

    # Mount checkpoint storage (READ-ONLY is fine for restore)
    volumeMounts:
    - name: checkpoint-storage
      mountPath: /checkpoints
      readOnly: true
    - name: checkpoint-signal
      mountPath: /checkpoint-signal

  volumes:
  - name: checkpoint-storage
    persistentVolumeClaim:
      claimName: chrek-pvc
  - name: checkpoint-signal
    hostPath:
      path: /var/lib/chrek/signals
      type: DirectoryOrCreate

How Restore Works#

  1. Smart Entrypoint Detects Checkpoint: The smart-entrypoint.sh checks if a checkpoint exists at /checkpoints/${DYN_CHECKPOINT_HASH}/

  2. Calls Restore Entrypoint: If found, calls /usr/local/bin/restore-entrypoint which invokes CRIU

  3. CRIU Restores Process: The entire process tree is restored from the checkpoint, including GPU state

  4. Application Continues: Your application resumes exactly where it was checkpointed


Environment Variables Reference#

Checkpoint Jobs#

Variable

Required

Description

DYN_CHECKPOINT_SIGNAL_FILE

Yes

Full path to signal file (e.g., /checkpoint-signal/my-checkpoint.done)

DYN_CHECKPOINT_READY_FILE

Yes

Full path where app signals readiness (e.g., /tmp/checkpoint-ready)

DYN_CHECKPOINT_HASH

Yes

Unique checkpoint identifier (alphanumeric string)

DYN_CHECKPOINT_LOCATION

Yes

Directory where checkpoint is stored (e.g., /checkpoints/abc123)

DYN_CHECKPOINT_STORAGE_TYPE

Yes

Storage backend: pvc, s3, or oci

Restore Pods#

Variable

Required

Description

DYN_CHECKPOINT_HASH

Yes

Checkpoint identifier (must match checkpoint job)

DYN_CHECKPOINT_PATH

Yes

Base checkpoint directory (hash appended automatically)

DYN_RESTORE_MARKER_FILE

No

Path for restore marker file (default: /tmp/dynamo-restored)

Optional CRIU Tuning (Advanced)#

Variable

Default

Description

CRIU_TIMEOUT

0 (unlimited)

CRIU operation timeout in seconds

CRIU_LOG_LEVEL

4

CRIU log verbosity (0-4)

CRIU_WORK_DIR

/tmp

CRIU working directory

CUDA_PLUGIN_DIR

/usr/local/lib/criu

Path to CRIU CUDA plugin

CRIU_SKIP_IN_FLIGHT

false

Skip in-flight TCP connections

CRIU_AUTO_DEDUP

false

Enable auto-deduplication

CRIU_LAZY_PAGES

false

Enable lazy page migration (experimental)

WAIT_FOR_CHECKPOINT

false

Wait for checkpoint to appear before starting

RESTORE_WAIT_TIMEOUT

300

Max seconds to wait for checkpoint

DEBUG

false

Enable debug mode (sleeps 300s on error)


Checkpoint Flow Explained#

1. Checkpoint Creation Flow#

┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with nvidia.com/checkpoint-source=true label  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Application loads model and creates ready file           │
│    /tmp/checkpoint-ready                                     │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Pod becomes Ready (kubelet readiness probe passes)       │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. ChReK DaemonSet detects:                                 │
│    - Pod is Ready                                            │
│    - Has checkpoint-source label                             │
│    - Ready file exists: /tmp/checkpoint-ready               │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. DaemonSet executes CRIU checkpoint via runc:             │
│    - Freezes container process                               │
│    - Dumps memory (CPU + GPU)                                │
│    - Saves to /checkpoints/${HASH}/                          │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. DaemonSet writes signal file:                            │
│    /checkpoint-signal/${HASH}.done                           │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 7. Application detects signal file and exits gracefully     │
└─────────────────────────────────────────────────────────────┘

2. Restore Flow#

┌─────────────────────────────────────────────────────────────┐
│ 1. Pod starts with DYN_CHECKPOINT_HASH set                  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. smart-entrypoint.sh checks for checkpoint:               │
│    /checkpoints/${DYN_CHECKPOINT_HASH}/checkpoint.done      │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ├─ Not Found ─────────────────┐
                       │                              │
                       ▼                              ▼
           ┌───────────────────────┐    ┌──────────────────────┐
           │ Checkpoint exists     │    │ Cold start           │
           └──────────┬────────────┘    │ Run original CMD     │
                      │                 └──────────────────────┘
                      ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Call restore-entrypoint with checkpoint path             │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. restore-entrypoint extracts checkpoint and calls CRIU:   │
│    criu restore --images-dir /checkpoints/${HASH}/images    │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. CRIU restores process from checkpoint                    │
│    - Restores memory (CPU + GPU)                             │
│    - Restores file descriptors                               │
│    - Resumes process execution                               │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. Application continues from checkpointed state            │
│    (Model already loaded, GPU memory initialized)           │
└─────────────────────────────────────────────────────────────┘

Troubleshooting#

Checkpoint Not Created#

Symptom: Job runs but no checkpoint appears in /checkpoints/

Checks:

  1. Verify the pod has the label:

    kubectl get pod <pod-name> -o jsonpath='{.metadata.labels.nvidia\.com/checkpoint-source}'
    
  2. Check pod readiness:

    kubectl get pod <pod-name> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
    
  3. Check ready file was created:

    kubectl exec <pod-name> -- ls -la /tmp/checkpoint-ready
    
  4. Check DaemonSet logs:

    kubectl logs -n my-app daemonset/chrek-agent --all-containers
    

Restore Fails#

Symptom: Pod fails to restore from checkpoint

Checks:

  1. Verify checkpoint files exist:

    kubectl exec <pod-name> -- ls -la /checkpoints/${DYN_CHECKPOINT_HASH}/
    
  2. Check privileged mode is enabled:

    kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].securityContext.privileged}'
    
  3. Check CRIU logs in /tmp/criu-restore.log:

    kubectl exec <pod-name> -- cat /tmp/criu-restore.log
    
  4. Ensure checkpoint and restore have same:

    • Container image

    • GPU count

    • Volume mounts

    • Environment variables (except POD_NAME, POD_IP, etc.)

Permission Denied Errors#

Symptom: CRIU: Permission denied or Operation not permitted

Solution: Ensure pod has:

securityContext:
  privileged: true
  capabilities:
    add:
    - SYS_ADMIN
    - SYS_PTRACE
    - SYS_CHROOT

Signal File Not Appearing#

Symptom: Application waits forever for signal file

Checks:

  1. Verify hostPath mount is correct:

    kubectl get pod <pod-name> -o jsonpath='{.spec.volumes[?(@.name=="checkpoint-signal")]}'
    
  2. Check DaemonSet has access to the same path:

    kubectl get daemonset -n my-app chrek-agent -o jsonpath='{.spec.template.spec.volumes[?(@.name=="signal-dir")]}'
    
  3. Verify paths match exactly:

    • Pod: /var/lib/chrek/signals

    • DaemonSet: /var/lib/chrek/signals


Additional Resources#


Getting Help#

If you encounter issues:

  1. Check the Troubleshooting section

  2. Review DaemonSet logs: kubectl logs -n <namespace> daemonset/chrek-agent

  3. Open an issue on GitHub