Fault Tolerance Testing#

This document describes the test infrastructure for validating Dynamo’s fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.

Overview#

Dynamo’s fault tolerance test suite is located in tests/fault_tolerance/ and includes:

Test Category	Location	Purpose
Cancellation	`cancellation/`	Request cancellation during in-flight operations
Migration	`migration/`	Request migration when workers fail
etcd HA	`etcd_ha/`	etcd failover and recovery
Hardware	`hardware/`	GPU and network fault injection
Deployment	`deploy/`	End-to-end deployment testing

Test Directory Structure#

tests/fault_tolerance/
├── cancellation/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── migration/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── etcd_ha/
│   ├── test_vllm.py
│   ├── test_trtllm.py
│   ├── test_sglang.py
│   └── utils.py
├── hardware/
│   └── fault_injection_service/
│       ├── api_service/
│       └── agents/
├── deploy/
│   ├── test_deployment.py
│   ├── scenarios.py
│   ├── base_checker.py
│   └── ...
└── client.py

Request Cancellation Tests#

Test that in-flight requests can be properly canceled.

Running Cancellation Tests#

# Run all cancellation tests
pytest tests/fault_tolerance/cancellation/ -v

# Run for specific backend
pytest tests/fault_tolerance/cancellation/test_vllm.py -v

Cancellation Test Utilities#

The cancellation/utils.py module provides:

CancellableRequest#

Thread-safe request cancellation via TCP socket manipulation:

from tests.fault_tolerance.cancellation.utils import CancellableRequest

request = CancellableRequest()

# Send request in separate thread
thread = Thread(target=send_request, args=(request,))
thread.start()

# Cancel after some time
time.sleep(1)
request.cancel()  # Closes underlying socket

send_completion_request / send_chat_completion_request#

Send cancellable completion requests:

from tests.fault_tolerance.cancellation.utils import (
    send_completion_request,
    send_chat_completion_request
)

# Non-streaming
response = send_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    prompt="Hello, world!",
    max_tokens=100
)

# Streaming with cancellation
responses = send_chat_completion_request(
    base_url="http://localhost:8000",
    model="Qwen/Qwen3-0.6B",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
    cancellable_request=request
)

poll_for_pattern#

Wait for specific patterns in logs:

from tests.fault_tolerance.cancellation.utils import poll_for_pattern

# Wait for cancellation confirmation
found = poll_for_pattern(
    log_file="/var/log/dynamo/worker.log",
    pattern="Request cancelled",
    timeout=30,
    interval=0.5
)

Migration Tests#

Test that requests migrate to healthy workers when failures occur.

Running Migration Tests#

# Run all migration tests
pytest tests/fault_tolerance/migration/ -v

# Run for specific backend
pytest tests/fault_tolerance/migration/test_vllm.py -v

Migration Test Utilities#

The migration/utils.py module provides:

Frontend wrapper with configurable request planes
Long-running request spawning for migration scenarios
Health check disabling for controlled testing

Example Migration Test#

def test_migration_on_worker_failure():
    # Start deployment with 2 workers
    deployment = start_deployment(workers=2)

    # Send long-running request
    request_thread = spawn_long_request(max_tokens=1000)

    # Kill one worker mid-generation
    kill_worker(deployment.workers[0])

    # Verify request completes on remaining worker
    response = request_thread.join()
    assert response.status_code == 200
    assert len(response.tokens) > 0

etcd HA Tests#

Test system behavior during etcd failures and recovery.

Running etcd HA Tests#

pytest tests/fault_tolerance/etcd_ha/ -v

Test Scenarios#

Leader failover: etcd leader node fails, cluster elects new leader
Network partition: etcd node becomes unreachable
Recovery: System recovers after etcd becomes available

Hardware Fault Injection#

The fault injection service enables testing under simulated hardware failures.

Fault Injection Service#

Located at tests/fault_tolerance/hardware/fault_injection_service/, this FastAPI service orchestrates fault injection:

# Start the fault injection service
cd tests/fault_tolerance/hardware/fault_injection_service
python -m api_service.main

Supported Fault Types#

GPU Faults#

Fault Type	Description
`XID_ERROR`	Simulate GPU XID error (various codes)
`THROTTLE`	GPU thermal throttling
`MEMORY_PRESSURE`	GPU memory exhaustion
`OVERHEAT`	GPU overheating condition
`COMPUTE_OVERLOAD`	GPU compute saturation

Network Faults#

Fault Type	Description
`FRONTEND_WORKER`	Partition between frontend and workers
`WORKER_NATS`	Partition between workers and NATS
`WORKER_WORKER`	Partition between workers
`CUSTOM`	Custom network partition

Fault Injection API#

Inject GPU Fault#

curl -X POST http://localhost:8080/api/v1/faults/gpu/inject \
  -H "Content-Type: application/json" \
  -d '{
    "target_pod": "vllm-worker-0",
    "fault_type": "XID_ERROR",
    "severity": "HIGH"
  }'

Inject Specific XID Error#

# Inject XID 79 (GPU memory page fault)
curl -X POST http://localhost:8080/api/v1/faults/gpu/inject/xid-79 \
  -H "Content-Type: application/json" \
  -d '{"target_pod": "vllm-worker-0"}'

Supported XID codes: 43, 48, 74, 79, 94, 95, 119, 120

Inject Network Partition#

curl -X POST http://localhost:8080/api/v1/faults/network/inject \
  -H "Content-Type: application/json" \
  -d '{
    "partition_type": "FRONTEND_WORKER",
    "duration_seconds": 30
  }'

Recover from Fault#

curl -X POST http://localhost:8080/api/v1/faults/{fault_id}/recover

List Active Faults#

curl http://localhost:8080/api/v1/faults

GPU Fault Injector Agent#

The GPU fault injector runs as a DaemonSet on worker nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-fault-injector
spec:
  selector:
    matchLabels:
      app: gpu-fault-injector
  template:
    spec:
      containers:
      - name: agent
        image: dynamo/gpu-fault-injector:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: dev
          mountPath: /dev

The agent injects fake XID messages via /dev/kmsg to trigger NVSentinel detection.

Deployment Testing Framework#

The deploy/ directory contains an end-to-end testing framework.

Test Phases#

Tests run through three phases:

Phase	Description
`STANDARD`	Baseline performance under normal conditions
`OVERFLOW`	System behavior during fault/overload
`RECOVERY`	System recovery after fault resolution

Scenario Configuration#

Define test scenarios in scenarios.py:

from tests.fault_tolerance.deploy.scenarios import Scenario, Load, Failure

scenario = Scenario(
    name="worker_failure_migration",
    backend="vllm",
    load=Load(
        clients=10,
        requests_per_client=100,
        max_tokens=256
    ),
    failure=Failure(
        type="pod_kill",
        target="vllm-worker-0",
        trigger_after_requests=50
    )
)

Running Deployment Tests#

# Run all deployment tests
pytest tests/fault_tolerance/deploy/test_deployment.py -v

# Run specific scenario
pytest tests/fault_tolerance/deploy/test_deployment.py::test_worker_failure -v

Validation Checkers#

The framework includes pluggable validators:

from tests.fault_tolerance.deploy.base_checker import BaseChecker, ValidationContext

class MigrationChecker(BaseChecker):
    def check(self, context: ValidationContext) -> bool:
        # Verify migrations occurred
        migrations = context.metrics.get("migrations_total", 0)
        return migrations > 0

Results Parsing#

Parse test results for analysis:

from tests.fault_tolerance.deploy.parse_results import process_overflow_recovery_test

results = process_overflow_recovery_test(log_dir="/path/to/logs")
print(f"Success rate: {results['success_rate']}")
print(f"P99 latency: {results['p99_latency_ms']}ms")

Client Utilities#

The client.py module provides shared client functionality:

Multi-Threaded Load Generation#

from tests.fault_tolerance.client import client

# Generate load with multiple clients
results = client(
    base_url="http://localhost:8000",
    num_clients=10,
    requests_per_client=100,
    model="Qwen/Qwen3-0.6B",
    max_tokens=256,
    log_dir="/tmp/test_logs"
)

Request Options#

Parameter	Description
`base_url`	Frontend URL
`num_clients`	Number of concurrent clients
`requests_per_client`	Requests per client
`model`	Model name
`max_tokens`	Max tokens per request
`log_dir`	Directory for client logs
`endpoint`	`completions` or `chat/completions`

Running the Full Test Suite#

Prerequisites#

Kubernetes cluster with GPU nodes
Dynamo deployment
etcd cluster (for HA tests)
Fault injection service (for hardware tests)

Environment Setup#

export KUBECONFIG=/path/to/kubeconfig
export DYNAMO_NAMESPACE=dynamo-test
export FRONTEND_URL=http://localhost:8000

Run All Tests#

# Install test dependencies
pip install pytest pytest-asyncio

# Run all fault tolerance tests
pytest tests/fault_tolerance/ -v --tb=short

# Run with specific markers
pytest tests/fault_tolerance/ -v -m "not slow"

Test Markers#

Marker	Description
`slow`	Long-running tests (> 5 minutes)
`gpu`	Requires GPU resources
`k8s`	Requires Kubernetes cluster
`etcd_ha`	Requires multi-node etcd

Best Practices#

1. Isolate Test Environments#

Run fault tolerance tests in dedicated namespaces:

kubectl create namespace dynamo-fault-test

2. Clean Up After Tests#

Ensure fault injection is recovered:

# List and recover all active faults
curl http://localhost:8080/api/v1/faults | jq -r '.[].id' | \
  xargs -I {} curl -X POST http://localhost:8080/api/v1/faults/{}/recover

3. Collect Logs#

Preserve logs for debugging:

pytest tests/fault_tolerance/ -v \
  --log-dir=/tmp/fault_test_logs \
  --capture=no

4. Monitor During Tests#

Watch system state during tests:

# Terminal 1: Watch pods
watch kubectl get pods -n dynamo-test

# Terminal 2: Watch metrics
watch 'curl -s localhost:8000/metrics | grep -E "(migration|rejection)"'