# Graceful Shutdown

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

## Overview

Graceful shutdown in Dynamo ensures that:

1. **No new requests are accepted** - Endpoints are immediately invalidated
2. **In-flight requests complete** - Existing requests finish processing (configurable)
3. **Resources are cleaned up** - Engines, connections, and temporary files are released
4. **Pods restart cleanly** - Exit codes signal Kubernetes for proper restart behavior

## Signal Handling

All Dynamo components handle Unix signals for graceful shutdown:

| Signal | Trigger | Behavior |
|--------|---------|----------|
| `SIGTERM` | Kubernetes pod termination | Graceful shutdown initiated |
| `SIGINT` | Ctrl+C / manual interrupt | Graceful shutdown initiated |

### Implementation

Each component registers signal handlers at startup:

```python
def signal_handler():
    asyncio.create_task(graceful_shutdown(runtime))

for sig in (signal.SIGTERM, signal.SIGINT):
    loop.add_signal_handler(sig, signal_handler)
```

The `graceful_shutdown()` function:
1. Logs the shutdown signal
2. Calls `runtime.shutdown()` to invalidate endpoints
3. Waits for in-flight requests (based on configuration)
4. Returns to allow cleanup to proceed

## Endpoint Draining

When `runtime.shutdown()` is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the `graceful_shutdown` parameter when serving the endpoint.

### Configuration

When registering an endpoint, the `graceful_shutdown` parameter controls draining behavior:

```python
generate_endpoint.serve_endpoint(
    handler.generate,
    graceful_shutdown=True,  # Wait for all requests to finish
    metrics_labels=[("model", model_name)],
    health_check_payload=health_check_payload,
)
```

| `graceful_shutdown` | Behavior |
|---------------------|----------|
| `True` | Wait for all in-flight requests to complete before returning |
| `False` | Return immediately without waiting for requests |

### Component-Specific Behavior

| Component | Default Behavior | Rationale |
|-----------|------------------|-----------|
| **Frontend** | N/A (HTTP server) | HTTP server handles its own shutdown |
| **Prefill Workers** | `graceful_shutdown=True` | Prefill operations must complete to avoid wasted computation |
| **Decode Workers** | Conditional | If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait |
| **Router** | `graceful_shutdown=True` | Ensure routing decisions complete |

### Decode Worker Migration Integration

Decode workers use conditional draining based on whether request migration is supported:

```python
generate_endpoint.serve_endpoint(
    handler.generate,
    graceful_shutdown=config.migration_limit <= 0,  # If no migration, wait for requests
    ...
)
```

When `migration_limit > 0`:
- Worker shuts down immediately (`graceful_shutdown=False`)
- In-flight requests are migrated to healthy workers
- No request loss occurs

When `migration_limit <= 0`:
- Worker waits for in-flight requests (`graceful_shutdown=True`)
- Migration is not available
- Requests complete on the shutting-down worker

## Resource Cleanup

After endpoint draining, components clean up their resources in `finally` blocks:

### vLLM Worker Cleanup

```python
finally:
    logger.debug("Cleaning up worker")
    handler.cleanup()
```

The handler's `cleanup()` method:
- Removes temporary directories (LoRA adapters, etc.)
- Releases engine resources

### SGLang Worker Cleanup

```python
def cleanup(self) -> None:
    # Cancel pending consume tasks
    for task in self._consume_tasks:
        if not task.done():
            task.cancel()
    self._consume_tasks.clear()

    # Shutdown engine
    self.engine.shutdown()
```

### TensorRT-LLM Worker Cleanup

```python
async def cleanup(self):
    if self._llm:
        try:
            self._llm.shutdown()
        except Exception as e:
            logging.error(f"Error during cleanup: {e}")
        finally:
            self._llm = None
```

## Error-Initiated Shutdown

Workers can initiate graceful shutdown when fatal errors occur:

### Engine Health Monitoring (vLLM)

The `VllmEngineMonitor` continuously checks engine health:

```python
async def _check_engine_health(self):
    while True:
        try:
            await self.engine_client.check_health()
            await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
        except EngineDeadError as e:
            logger.error(f"Health check failed: {e}")
            self._shutdown_engine()
            self.runtime.shutdown()
            os._exit(1)
```

Configuration:
- `HEALTH_CHECK_INTERVAL`: 2 seconds between checks
- `ENGINE_SHUTDOWN_TIMEOUT`: 30 seconds max for engine shutdown

### Fatal Error Handling (TensorRT-LLM)

```python
async def _initiate_shutdown(self, error: Exception):
    logging.warning(f"Initiating graceful shutdown due to: {error}")

    try:
        if self.runtime:
            self.runtime.shutdown()
        if self.engine:
            await self.engine.cleanup()
    except Exception as cleanup_error:
        logging.error(f"Error during graceful shutdown: {cleanup_error}")
    finally:
        logging.critical("Forcing process exit for restart")
        os._exit(1)
```

## Kubernetes Integration

### Pod Termination Flow

1. Kubernetes sends `SIGTERM` to the pod
2. Dynamo initiates graceful shutdown
3. Pod has `terminationGracePeriodSeconds` to complete (default: 30s)
4. If not terminated, Kubernetes sends `SIGKILL`

### Recommended Configuration

For production deployments, configure adequate termination grace period:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
  services:
    VllmWorker:
      extraPodSpec:
        terminationGracePeriodSeconds: 60  # Allow time for request draining
```

### Health Check Integration

Kubernetes uses health endpoints to determine pod readiness:

- **During shutdown**: Endpoints become unavailable
- **Readiness probe fails**: Traffic stops routing to the pod
- **Graceful draining**: Existing requests complete

## Best Practices

### 1. Set Appropriate Grace Periods

Match `terminationGracePeriodSeconds` to your expected request completion time:
- Short requests (\< 10s): 30s grace period
- Long generation (> 30s): 120s+ grace period

### 2. Enable Request Migration for Decode Workers

If using disaggregated serving, enable migration for decode workers:

```python
--migration-limit 3  # Allow up to 3 migration attempts
```

This allows immediate shutdown while preserving request state.

### 3. Monitor Shutdown Metrics

Track shutdown behavior via logs:

```
INFO  Received shutdown signal, shutting down DistributedRuntime
INFO  DistributedRuntime shutdown complete
DEBUG Cleaning up worker
```

### 4. Handle Cleanup Errors

Ensure cleanup methods handle errors gracefully:

```python
def cleanup(self):
    for resource in self.resources:
        try:
            resource.cleanup()
        except Exception as e:
            logger.warning(f"Cleanup failed: {e}")
            # Continue with other resources
```

## Related Documentation

- [Request Migration](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-migration) - How requests migrate during shutdown
- [Request Cancellation](/dynamo/v-0-9-0/user-guides/fault-tolerance/request-cancellation) - Canceling in-flight requests
- [Health Checks](/dynamo/v-0-9-0/user-guides/observability-local/health-checks) - Liveness and readiness probes