Writing a Python Unified Backend
Writing a Python Unified Backend
New — Dynamo’s unified backend. This guide covers the new unified backend infrastructure in
dynamo.common.backend: a sharedLLMEngineABC that vLLM, SGLang, TRT-LLM, and a sample engine already implement, and that any custom Python engine can plug into the same way. For the Rust version of the same contract see Writing a Rust Unified Backend. For the older lower-level Python worker path (register_model+serve_endpoint) — still the right choice for features the unified backend does not yet cover — see Writing Python Workers.Beta — actively under development. The unified backend surface is beta quality and may change without backwards compatibility between releases. See Feature gaps below for what the unified path covers today versus the existing (non-unified) backend paths.
This guide walks through building a Python backend for an inference
engine that plugs into Dynamo’s distributed runtime via
dynamo.common.backend. A “unified backend” is a Python entry point
that implements the shared LLMEngine ABC and lets the framework own
runtime lifecycle (signal handling, model registration, graceful
shutdown, cancellation monitoring) — your code just owns inference.
Your backend lives in its own package and does not need to be part
of the dynamo repository. It depends on ai-dynamo from PyPI (or
the git source) and imports dynamo.common.backend. The steps below
assume you’re starting a fresh package in your own repo.
The reference example is the sample engine at
sample_engine.py
— a complete, runnable implementation under 120 lines. Read it
alongside this guide.
Where to look for what:
- This guide — step-by-step walkthrough for someone starting a new backend from scratch.
LLMEngineABC docstrings — authoritative method-by-method contract.- Package README
— in-tree reference:
GenerateRequest/GenerateChunkfield definitions, per-engine cancellation cookbook (vLLM / SGLang / TRT-LLM), fullDynamoExceptiontable, file index, and the per-engine feature-gap matrix.
Feature gaps
The unified backend is in beta and does not yet cover the full feature set of Dynamo’s existing (non-unified) backend paths. The summary below is the common contract — what every engine on the unified path gets. Per-engine gaps (vLLM, SGLang, TRT-LLM specifics like LoRA, diffusion, attention DP scheduling) live in the package README.
Supported today
- Aggregated token-in-token-out inference
- Disaggregated serving (
agg/prefill/decode) with bootstrap (SGLang) or internal KV transport (vLLM, TRT-LLM) - Model registration with discovery and endpoint types
- Request cancellation via
abort()+context.is_stopped() DynamoExceptionerror chain wrapping- Graceful shutdown with signal handling
- Finish reason normalization (handled by the Rust layer)
Not yet on the unified path
If you need one of these features today, keep that workload on the
existing per-engine entry point (dynamo.<backend>.main) until the
unified path catches up.
What you’re building
A backend is two things:
- An engine class that subclasses
LLMEngine— owns the model, accepts preprocessed token requests, streams output chunks. - A
main.pyentry point — a three-line shim that hands the engine class torun()fromdynamo.common.backend.run, which drives the lifecycle.
The dynamo.common.backend package handles everything else: signal
handling, distributed runtime setup, model registration with
discovery, the serving loop, graceful shutdown, cancellation
monitoring, and error chain wrapping. (The lifecycle state machine
actually lives in Rust; dynamo.common.backend.Worker is a thin
Python shim over it.)
Prerequisites
- Python 3.11 or newer.
dynamousestyping.Required, which is 3.11+. - NATS and etcd reachable for end-to-end runs. The dynamo repo’s
deploy/docker-compose.ymlbrings up both in one command if you don’t already have them running. uvorpipfor installing dependencies.- Familiarity with
asyncPython (asyncio, async generators) andargparse.
Step 1: Create the package
Minimal pyproject.toml:
For a bleeding-edge dependency on the dynamo source tree, install the runtime wheel from a clone:
Building the wheel needs a Rust toolchain plus clang, cmake,
protobuf-compiler, and libssl-dev.
Step 2: Subclass LLMEngine
In src/my_backend/engine.py, declare a class that subclasses
LLMEngine and owns whatever state your engine needs. Construction
must be cheap and side-effect-free — heavy work goes in start().
GenerateRequest and GenerateChunk are TypedDicts describing the
shared shape — see Step 4 for the fields.
Step 3: Implement from_args
from_args is a classmethod factory that parses CLI args and returns
(engine, WorkerConfig). The engine is constructed but not
started.
from_args is async to match the ABC; you can await from it if
your CLI parsing reads config from a file or hits an API. Most
backends don’t need to.
For backends that already have a DynamoRuntimeConfig-shaped
config object (e.g. ones derived from vLLM’s, SGLang’s, or
TRT-LLM’s existing config), prefer the
WorkerConfig.from_runtime_config(runtime_cfg, model_name=...)
helper — it pulls the shared discovery / request-plane / parser
fields off the config in one line.
Step 4: Implement LLMEngine methods
The ABC has three required methods (start, generate, cleanup)
plus two with default no-op implementations (abort, drain).
start()
Start the engine and return EngineConfig metadata. After this
returns, generate() MUST be ready for concurrent calls.
worker_id is an opaque per-worker identifier — most engines ignore
it. Backends needing a stable cluster-wide key (e.g. TRT-LLM’s
disagg_machine_id snowflake) should derive from it instead of
hashing host/pid or asking operators for a CLI override.
Every EngineConfig field except model is optional. None means
“don’t advertise”; KV-aware routing falls back to round-robin when KV
fields are unset.
generate()
An async generator that yields GenerateChunk dicts for a single
request. Called concurrently for multiple in-flight requests.
Contract (chunk shape is defined by the GenerateChunk TypedDict
— see
Request / Response Types
in the package README for the field reference):
- Every chunk carries
token_idsandindex(use0for single choice). - The final chunk additionally carries
finish_reasonandcompletion_usage. - The framework’s cancellation monitor calls
engine.abort(context)when the client disconnects or cancels; your loop should also pollcontext.is_stopped()between yields and exit cleanly with afinish_reason="cancelled"chunk.
Finish reason normalization ("abort" → "cancelled", etc.) is
handled by the Rust layer — emit whatever your engine uses
natively.
abort(context) — optional
Called by the framework only when the client disconnects or the request is cancelled. NOT called on silent stream drops. Override to release engine-side resources (KV slots, scheduler entries, remote schedulers):
For cleanup that must run on every drop path — including silent
drops — use a try/finally or a context manager inside generate,
not abort. The sample engine doesn’t override abort because it
has no engine-side state to release; the default is a no-op.
drain() — optional
Runs once before shutdown, after the discovery unregister + grace-period sleep, while NATS/etcd are still alive. Use it for backend-side draining that must complete before transport teardown (e.g. in-flight NIXL KV transfers on prefill workers). Default is no-op.
cleanup()
Two real requirements, both pinned by the Rust-side conformance kit:
- Null-safe against partial
start()failure. Ifstart()raises partway through, fields you allocate incrementally may still beNone.cleanup()must guard each resource (if self._engine is not None: …) so the post-failure call doesn’t crash on half-initialized state. - Idempotent. A second call after a successful first must return cleanly without re-entering teardown.
The Rust Worker drives both: it calls cleanup() after start()
returns Ok on shutdown, and the conformance kit (run_conformance)
additionally calls cleanup() on a never-started engine and twice in a
row, failing your tests with CleanupWithoutStartFailed /
SecondCleanupFailed if either invariant breaks. The guarded
single-shot pattern below covers both:
Step 5: Write main.py
Three lines.
run installs signal handlers, builds the distributed runtime,
calls engine.start(worker_id) with a runtime-allocated identifier,
registers the model with discovery, serves the endpoint, and runs the
graceful-shutdown orchestrator on SIGTERM/SIGINT.
Pair this with the [project.scripts] entry from Step 1’s
pyproject.toml so my-backend ... works as a console command.
Step 6: Errors and logging
Errors: the framework wraps non-DynamoException errors raised
from generate() (or lifecycle methods) as Unknown. For typed
error reporting, raise a DynamoException subclass directly from
dynamo.llm.exceptions
— it propagates unchanged through the Rust bridge:
The package README has the full table of exception types and which
lifecycle phase raises which one. Engine-init failures should raise
EngineShutdown from start(). Cleanup shouldn’t normally raise —
log and swallow if a subsystem fails.
Logging: keep levels consistent across unified backends so operators see the same surface regardless of which engine they’re running:
logger.info— lifecycle milestones (engine init complete, serving started, engine shutdown).logger.debug— per-request events (request abort, cancellation).logger.warning— recoverable problems (empty outputs, unexpected finish reasons).logger.error— unrecoverable failures only.
The framework also configures dynamo.runtime.logging for you; you
just call logger = logging.getLogger(__name__) at the top of your
module and use it.
Step 7: Test your engine
Install the dev extras (pytest, pytest-asyncio) declared in Step 1:
The sample engine has a unit-test suite that you can copy as a starting point. The shape of a useful test:
Cover the happy path, cancellation, and any backend-specific edge
cases (stop tokens, max-tokens cap, empty prompt). Three to five
focused tests is plenty — the framework already pins the lifecycle
state machine and cancellation contract with Rust-side tests in
lib/backend-common.
Step 8: Run it locally
Three moving parts need to come up: NATS + etcd (discovery and the event/request planes), the Dynamo frontend (HTTP → backend discovery), and your backend.
Then send a request:
A successful response has non-empty choices[0].message.content
and a finish_reason of stop or length.
jq -e '.choices[0].finish_reason' is a good one-liner for a CI
smoke test.
If your backend looks silent, set DYN_LOG=info (or
DYN_LOG=debug,dynamo=debug for finer scoping) before launching —
the framework configures tracing from DYN_LOG.
Reference: the sample engine
sample_engine.py
is the canonical minimal reference. Run it as-is:
It generates rotating token IDs with no ML dependencies, so it’s a useful stand-in for AIPerf / end-to-end pipeline smoke tests. Lift these patterns:
from_argsparses CLI args and returns(engine, WorkerConfig)with no awaits.start()returns anEngineConfigwhose KV fields are illustrative but not load-bearing (no real KV cache).generate()pollscontext.is_stopped()between yields and emits acancelledterminal on observation.cleanup()is a no-op because the engine holds no resources.
Checklist
Before shipping:
-
LLMEnginesubclassed;from_argsreturns(engine, WorkerConfig). -
start()returnsEngineConfigwith at least a non-emptymodel. -
generate()pollscontext.is_stopped()between yields and emits a"cancelled"terminal on observation. - Final chunk has
finish_reasonandcompletion_usage. - Typed
DynamoExceptionsubclasses used for error reporting where the category matters. -
cleanup()releases all engine resources. - Logging levels match the standards in Step 6.
See also
LLMEngineABC — authoritative contract.- Package README — feature gaps, error model, request/response contract.
- Sample engine — example user guide.
- Writing a Rust Unified Backend — the Rust counterpart, same contract, lower-level.