Graceful Shutdown#

This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.

Overview#

Graceful shutdown in Dynamo ensures that:

  1. No new requests are accepted - Endpoints are immediately invalidated

  2. In-flight requests complete - Existing requests finish processing (configurable)

  3. Resources are cleaned up - Engines, connections, and temporary files are released

  4. Pods restart cleanly - Exit codes signal Kubernetes for proper restart behavior

Signal Handling#

All Dynamo components handle Unix signals for graceful shutdown:

Signal

Trigger

Behavior

SIGTERM

Kubernetes pod termination

Graceful shutdown initiated

SIGINT

Ctrl+C / manual interrupt

Graceful shutdown initiated

Implementation#

Each component registers signal handlers at startup:

def signal_handler():
    asyncio.create_task(graceful_shutdown(runtime))

for sig in (signal.SIGTERM, signal.SIGINT):
    loop.add_signal_handler(sig, signal_handler)

The graceful_shutdown() function:

  1. Logs the shutdown signal

  2. Calls runtime.shutdown() to invalidate endpoints

  3. Waits for in-flight requests (based on configuration)

  4. Returns to allow cleanup to proceed

Endpoint Draining#

When runtime.shutdown() is called, endpoints are immediately invalidated so no new requests are accepted. The behavior for in-flight requests depends on the graceful_shutdown parameter when serving the endpoint.

Configuration#

When registering an endpoint, the graceful_shutdown parameter controls draining behavior:

generate_endpoint.serve_endpoint(
    handler.generate,
    graceful_shutdown=True,  # Wait for all requests to finish
    metrics_labels=[("model", model_name)],
    health_check_payload=health_check_payload,
)

graceful_shutdown

Behavior

True

Wait for all in-flight requests to complete before returning

False

Return immediately without waiting for requests

Component-Specific Behavior#

Component

Default Behavior

Rationale

Frontend

N/A (HTTP server)

HTTP server handles its own shutdown

Prefill Workers

graceful_shutdown=True

Prefill operations must complete to avoid wasted computation

Decode Workers

Conditional

If migration is enabled (migration_limit > 0), shutdown immediately to allow migration; otherwise wait

Router

graceful_shutdown=True

Ensure routing decisions complete

Decode Worker Migration Integration#

Decode workers use conditional draining based on whether request migration is supported:

generate_endpoint.serve_endpoint(
    handler.generate,
    graceful_shutdown=config.migration_limit <= 0,  # If no migration, wait for requests
    ...
)

When migration_limit > 0:

  • Worker shuts down immediately (graceful_shutdown=False)

  • In-flight requests are migrated to healthy workers

  • No request loss occurs

When migration_limit <= 0:

  • Worker waits for in-flight requests (graceful_shutdown=True)

  • Migration is not available

  • Requests complete on the shutting-down worker

Resource Cleanup#

After endpoint draining, components clean up their resources in finally blocks:

vLLM Worker Cleanup#

finally:
    logger.debug("Cleaning up worker")
    handler.cleanup()

The handler’s cleanup() method:

  • Removes temporary directories (LoRA adapters, etc.)

  • Releases engine resources

SGLang Worker Cleanup#

def cleanup(self) -> None:
    # Cancel pending consume tasks
    for task in self._consume_tasks:
        if not task.done():
            task.cancel()
    self._consume_tasks.clear()

    # Shutdown engine
    self.engine.shutdown()

TensorRT-LLM Worker Cleanup#

async def cleanup(self):
    if self._llm:
        try:
            self._llm.shutdown()
        except Exception as e:
            logging.error(f"Error during cleanup: {e}")
        finally:
            self._llm = None

Error-Initiated Shutdown#

Workers can initiate graceful shutdown when fatal errors occur:

Engine Health Monitoring (vLLM)#

The VllmEngineMonitor continuously checks engine health:

async def _check_engine_health(self):
    while True:
        try:
            await self.engine_client.check_health()
            await asyncio.sleep(HEALTH_CHECK_INTERVAL)  # 2 seconds
        except EngineDeadError as e:
            logger.error(f"Health check failed: {e}")
            self._shutdown_engine()
            self.runtime.shutdown()
            os._exit(1)

Configuration:

  • HEALTH_CHECK_INTERVAL: 2 seconds between checks

  • ENGINE_SHUTDOWN_TIMEOUT: 30 seconds max for engine shutdown

Fatal Error Handling (TensorRT-LLM)#

async def _initiate_shutdown(self, error: Exception):
    logging.warning(f"Initiating graceful shutdown due to: {error}")

    try:
        if self.runtime:
            self.runtime.shutdown()
        if self.engine:
            await self.engine.cleanup()
    except Exception as cleanup_error:
        logging.error(f"Error during graceful shutdown: {cleanup_error}")
    finally:
        logging.critical("Forcing process exit for restart")
        os._exit(1)

Kubernetes Integration#

Pod Termination Flow#

  1. Kubernetes sends SIGTERM to the pod

  2. Dynamo initiates graceful shutdown

  3. Pod has terminationGracePeriodSeconds to complete (default: 30s)

  4. If not terminated, Kubernetes sends SIGKILL

Health Check Integration#

Kubernetes uses health endpoints to determine pod readiness:

  • During shutdown: Endpoints become unavailable

  • Readiness probe fails: Traffic stops routing to the pod

  • Graceful draining: Existing requests complete

Best Practices#

1. Set Appropriate Grace Periods#

Match terminationGracePeriodSeconds to your expected request completion time:

  • Short requests (< 10s): 30s grace period

  • Long generation (> 30s): 120s+ grace period

2. Enable Request Migration for Decode Workers#

If using disaggregated serving, enable migration for decode workers:

--migration-limit 3  # Allow up to 3 migration attempts

This allows immediate shutdown while preserving request state.

3. Monitor Shutdown Metrics#

Track shutdown behavior via logs:

INFO  Received shutdown signal, shutting down DistributedRuntime
INFO  DistributedRuntime shutdown complete
DEBUG Cleaning up worker

4. Handle Cleanup Errors#

Ensure cleanup methods handle errors gracefully:

def cleanup(self):
    for resource in self.resources:
        try:
            resource.cleanup()
        except Exception as e:
            logger.warning(f"Cleanup failed: {e}")
            # Continue with other resources