Graceful Shutdown#

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

Overview#

Graceful shutdown in Dynamo ensures that:

No new requests are accepted - Endpoints are immediately invalidated
In-flight requests complete - Existing requests finish processing (configurable)
Resources are cleaned up - Engines, connections, and temporary files are released
Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior

Signal Handling#

All Dynamo components handle Unix signals for graceful shutdown:

Signal	Trigger	Behavior
`SIGTERM`	Kubernetes pod termination	Graceful shutdown initiated
`SIGINT`	Ctrl+C / manual interrupt	Graceful shutdown initiated

Implementation#

Each component registers signal handlers at startup:

def signal_handler():
    asyncio.create_task(graceful_shutdown(runtime))

for sig in (signal.SIGTERM, signal.SIGINT):
    loop.add_signal_handler(sig, signal_handler)

The graceful_shutdown() function:

Logs the shutdown signal
Calls runtime.shutdown() to invalidate endpoints
Waits for in-flight requests (based on configuration)
Returns to allow cleanup to proceed

Endpoint Draining#

When runtime.shutdown() is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.

Configuration#

When registering an endpoint, the graceful_shutdown parameter controls draining behavior:

generate_endpoint.serve_endpoint(
    handler.generate,
    graceful_shutdown=True,  # Wait for all requests to finish
    metrics_labels=[("model", model_name)],
    health_check_payload=health_check_payload,
)

`graceful_shutdown`	Behavior
`True`	Wait for all in-flight requests to complete before returning
`False`	Return immediately without waiting for requests

Component-Specific Behavior#

Component	Default Behavior	Rationale
Frontend	N/A (HTTP server)	HTTP server handles its own shutdown
Prefill Workers	`graceful_shutdown=True`	Prefill operations must complete to avoid wasted computation
Decode Workers	Conditional	If migration is enabled (`migration_limit > 0`), shutdown immediately to allow migration; otherwise wait
Router	`graceful_shutdown=True`	Ensure routing decisions complete

Decode Worker Migration Integration#

Decode workers use conditional draining based on whether request migration is supported:

generate_endpoint.serve_endpoint(
    handler.generate,
    graceful_shutdown=config.migration_limit <= 0,  # If no migration, wait for requests
    ...
)

When migration_limit > 0:

Worker shuts down immediately (graceful_shutdown=False)
In-flight requests are migrated to healthy workers
No request loss occurs

When migration_limit <= 0:

Worker waits for in-flight requests (graceful_shutdown=True)
Migration is not available
Requests complete on the shutting-down worker

Resource Cleanup#

After endpoint draining, components clean up their resources in finally blocks:

vLLM Worker Cleanup#

finally:
    logger.debug("Cleaning up worker")
    handler.cleanup()

The handler’s cleanup() method:

Removes temporary directories (LoRA adapters, etc.)
Releases engine resources

SGLang Worker Cleanup#

def cleanup(self) -> None:
    # Cancel pending consume tasks
    for task in self._consume_tasks:
        if not task.done():
            task.cancel()
    self._consume_tasks.clear()

    # Shutdown engine
    self.engine.shutdown()

TensorRT-LLM Worker Cleanup#

async def cleanup(self):
    if self._llm:
        try:
            self._llm.shutdown()
        except Exception as e:
            logging.error(f"Error during cleanup: {e}")
        finally:
            self._llm = None

Error-Initiated Shutdown#

Workers can initiate graceful shutdown when fatal errors occur:

Engine Health Monitoring (vLLM)#

The VllmEngineMonitor continuously checks engine health:

async def _check_engine_health(self):
    while True:
        try:
            await self.engine_client.check_health()
            await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
        except EngineDeadError as e:
            logger.error(f"Health check failed: {e}")
            self._shutdown_engine()
            self.runtime.shutdown()
            os._exit(1)

Configuration:

HEALTH_CHECK_INTERVAL: 2 seconds between checks
ENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown

Fatal Error Handling (TensorRT-LLM)#

async def _initiate_shutdown(self, error: Exception):
    logging.warning(f"Initiating graceful shutdown due to: {error}")

    try:
        if self.runtime:
            self.runtime.shutdown()
        if self.engine:
            await self.engine.cleanup()
    except Exception as cleanup_error:
        logging.error(f"Error during graceful shutdown: {cleanup_error}")
    finally:
        logging.critical("Forcing process exit for restart")
        os._exit(1)

Kubernetes Integration#

Pod Termination Flow#

Kubernetes sends SIGTERM to the pod
Dynamo initiates graceful shutdown
Pod has terminationGracePeriodSeconds to complete (default: 30s)
If not terminated, Kubernetes sends SIGKILL

Recommended Configuration#

For production deployments, configure adequate termination grace period:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
  services:
    VllmWorker:
      extraPodSpec:
        terminationGracePeriodSeconds: 60  # Allow time for request draining

Health Check Integration#

Kubernetes uses health endpoints to determine pod readiness:

During shutdown: Endpoints become unavailable
Readiness probe fails: Traffic stops routing to the pod
Graceful draining: Existing requests complete

Best Practices#

1. Set Appropriate Grace Periods#

Match terminationGracePeriodSeconds to your expected request completion time:

Short requests (< 10s): 30s grace period
Long generation (> 30s): 120s+ grace period

2. Enable Request Migration for Decode Workers#

If using disaggregated serving, enable migration for decode workers:

--migration-limit 3  # Allow up to 3 migration attempts

This allows immediate shutdown while preserving request state.

3. Monitor Shutdown Metrics#

Track shutdown behavior via logs:

INFO  Received shutdown signal, shutting down DistributedRuntime
INFO  DistributedRuntime shutdown complete
DEBUG Cleaning up worker

4. Handle Cleanup Errors#

Ensure cleanup methods handle errors gracefully:

def cleanup(self):
    for resource in self.resources:
        try:
            resource.cleanup()
        except Exception as e:
            logger.warning(f"Cleanup failed: {e}")
            # Continue with other resources