New — Dynamo’s unified backend. This guide covers the new unified backend infrastructure in
dynamo.common.backend: a sharedLLMEngineABC that vLLM, SGLang, TRT-LLM, and a sample engine already implement, and that any custom Python engine can plug into the same way. For the Rust version of the same contract see Writing a Rust Unified Backend. For the older lower-level Python worker path (register_model+serve_endpoint) — still the right choice for features the unified backend does not yet cover — see Writing Python Workers.Beta — actively under development. The unified backend surface is beta quality and may change without backwards compatibility between releases. See Feature gaps below for what the unified path covers today versus the existing (non-unified) backend paths.
This guide walks through building a Python backend for an inference
engine that plugs into Dynamo’s distributed runtime via
dynamo.common.backend. A “unified backend” is a Python entry point
that implements the shared LLMEngine ABC and lets the framework own
runtime lifecycle (signal handling, model registration, graceful
shutdown, cancellation monitoring) — your code just owns inference.
Your backend lives in its own package and does not need to be part
of the dynamo repository. It depends on ai-dynamo from PyPI (or
the git source) and imports dynamo.common.backend. The steps below
assume you’re starting a fresh package in your own repo.
The reference example is the sample engine at
sample_engine.py
— a complete, runnable implementation under 120 lines. Read it
alongside this guide.
Where to look for what:
LLMEngine ABC docstrings
— authoritative method-by-method contract.GenerateRequest / GenerateChunk field
definitions, per-engine cancellation cookbook (vLLM / SGLang /
TRT-LLM), full DynamoException table, file index, and the
per-engine feature-gap matrix.The unified backend is in beta. The summary below is the common contract — what every engine on the unified path gets — plus the gaps that apply to all three engines. Per-engine specifics (vLLM sleep/wake, SGLang diffusion, TRT-LLM custom logits processors, etc.) live in the package README.
Supported today
Lifecycle and runtime:
agg / prefill / decode) — KV transfer
uses NIXL across all three engines; SGLang exchanges a Dynamo-level
bootstrap address (host/port/room), vLLM and TRT-LLM use an
engine-internal handshakeabort() + context.is_stopped()drain() hook for pre-cleanup work (e.g. in-flight NIXL transfers)DynamoException error chain wrappingObservability:
health_check_payload() (plus
DYN_HEALTH_CHECK_PAYLOAD / --health-check-payload overrides)vllm: / sglang: /
trtllm_ / lmcache:) via register_prometheus()cleanup_time_seconds,
drain_time_seconds, model_load_time_seconds) — always ondynamo_component_* gauges + router kv_used_blocks
signal via component_metrics_dp_ranks() +
attach_snapshot_publisher() + ComponentSnapshot pushkv_event_sources() returning
ZmqSource or PushSourcedp_rank.forced_dp_rank /
validate_global_dp_rank + EngineConfig.data_parallel_{size, start_rank}telemetry.current_span /
start_span plus W3C trace header propagation through
telemetry.engine_trace_kwargs(context)Request handling:
StructuredOutputsParams) and
TRT-LLM (GuidedDecodingParams) cover JSON schema / regex / grammar
/ choice; SGLang (_get_guided_decoding_params) covers JSON schema
only — regex / grammar / choice are silently dropped today (see the
SGLang-specific gaps in the package README)WorkerConfig.structural_tag_{mode, scope, schema} and serialize_structural_tagWorkerConfig.custom_jinja_template (frontend applies; the
backend advertises through model registration)tool_call_parser,
reasoning_parser, exclude_tools_when_tool_choice_none)Not yet on the unified path (common to all engines)
If you need one of these features today, keep that workload on the
existing per-engine entry point (dynamo.<backend>.main) until the
unified path catches up.
A backend is two things:
LLMEngine — owns the model,
accepts preprocessed token requests, streams output chunks.main.py entry point — a three-line shim that hands the
engine class to run() from dynamo.common.backend.run, which
drives the lifecycle.The dynamo.common.backend package handles everything else: signal
handling, distributed runtime setup, model registration with
discovery, the serving loop, graceful shutdown, cancellation
monitoring, and error chain wrapping. (The lifecycle state machine
actually lives in Rust; dynamo.common.backend.Worker is a thin
Python shim over it.)
dynamo uses typing.Required, which is 3.11+.deploy/docker-compose.yml brings up both in one command if you
don’t already have them running.uv or pip for installing dependencies.async Python (asyncio, async generators) and
argparse.Minimal pyproject.toml:
For a bleeding-edge dependency on the dynamo source tree, install the runtime wheel from a clone:
Building the wheel needs a Rust toolchain plus clang, cmake,
protobuf-compiler, and libssl-dev.
LLMEngineIn src/my_backend/engine.py, declare a class that subclasses
LLMEngine and owns whatever state your engine needs. Construction
must be cheap and side-effect-free — heavy work goes in start().
GenerateRequest and GenerateChunk are TypedDicts describing the
shared shape — see Step 4 for the fields.
from_argsfrom_args is a classmethod factory that parses CLI args and returns
(engine, WorkerConfig). The engine is constructed but not
started.
from_args is async to match the ABC; you can await from it if
your CLI parsing reads config from a file or hits an API. Most
backends don’t need to.
For backends that already have a DynamoRuntimeConfig-shaped
config object (e.g. ones derived from vLLM’s, SGLang’s, or
TRT-LLM’s existing config), prefer the
WorkerConfig.from_runtime_config(runtime_cfg, model_name=...)
helper — it pulls the shared discovery / request-plane / parser
fields off the config in one line.
LLMEngine methodsThe ABC has three required methods (start, generate, cleanup)
plus two with default no-op implementations (abort, drain).
start()Start the engine and return EngineConfig metadata. After this
returns, generate() MUST be ready for concurrent calls.
worker_id is an opaque per-worker identifier — most engines ignore
it. Backends needing a stable cluster-wide key (e.g. TRT-LLM’s
disagg_machine_id snowflake) should derive from it instead of
hashing host/pid or asking operators for a CLI override.
Every EngineConfig field except model is optional. None means
“don’t advertise”; KV-aware routing falls back to round-robin when KV
fields are unset.
generate()An async generator that yields GenerateChunk dicts for a single
request. Called concurrently for multiple in-flight requests.
Contract (chunk shape is defined by the GenerateChunk TypedDict
— see
Request / Response Types
in the package README for the field reference):
token_ids and index (use 0 for single
choice).finish_reason and
completion_usage.engine.abort(context)
when the client disconnects or cancels; your loop should also poll
context.is_stopped() between yields and exit cleanly with a
finish_reason="cancelled" chunk.Finish reason normalization ("abort" → "cancelled", etc.) is
handled by the Rust layer — emit whatever your engine uses
natively.
abort(context) — optionalCalled by the framework only when the client disconnects or the request is cancelled. NOT called on silent stream drops. Override to release engine-side resources (KV slots, scheduler entries, remote schedulers):
For cleanup that must run on every drop path — including silent
drops — use a try/finally or a context manager inside generate,
not abort. The sample engine doesn’t override abort because it
has no engine-side state to release; the default is a no-op.
drain() — optionalRuns once before shutdown, after the discovery unregister + grace-period sleep, while NATS/etcd are still alive. Use it for backend-side draining that must complete before transport teardown (e.g. in-flight NIXL KV transfers on prefill workers). Default is no-op.
cleanup()Two real requirements, both pinned by the Rust-side conformance kit:
start() failure. If start() raises
partway through, fields you allocate incrementally may still be
None. cleanup() must guard each resource (if self._engine is not None: …) so the post-failure call doesn’t crash on
half-initialized state.The Rust Worker drives both: it calls cleanup() after start()
returns Ok on shutdown, and the conformance kit (run_conformance)
additionally calls cleanup() on a never-started engine and twice in a
row, failing your tests with CleanupWithoutStartFailed /
SecondCleanupFailed if either invariant breaks. The guarded
single-shot pattern below covers both:
main.pyThree lines.
run installs signal handlers, builds the distributed runtime,
calls engine.start(worker_id) with a runtime-allocated identifier,
registers the model with discovery, serves the endpoint, and runs the
graceful-shutdown orchestrator on SIGTERM/SIGINT.
Pair this with the [project.scripts] entry from Step 1’s
pyproject.toml so my-backend ... works as a console command.
Errors: the framework wraps non-DynamoException errors raised
from generate() (or lifecycle methods) as Unknown. For typed
error reporting, raise a DynamoException subclass directly from
dynamo.llm.exceptions
— it propagates unchanged through the Rust bridge:
The package README has the full table of exception types and which
lifecycle phase raises which one. Engine-init failures should raise
EngineShutdown from start(). Cleanup shouldn’t normally raise —
log and swallow if a subsystem fails.
Logging: keep levels consistent across unified backends so operators see the same surface regardless of which engine they’re running:
logger.info — lifecycle milestones (engine init complete,
serving started, engine shutdown).logger.debug — per-request events (request abort, cancellation).logger.warning — recoverable problems (empty outputs, unexpected
finish reasons).logger.error — unrecoverable failures only.The framework also configures dynamo.runtime.logging for you; you
just call logger = logging.getLogger(__name__) at the top of your
module and use it.
Install the dev extras (pytest, pytest-asyncio) declared in Step 1:
The sample engine has a unit-test suite that you can copy as a starting point. The shape of a useful test:
Cover the happy path, cancellation, and any backend-specific edge
cases (stop tokens, max-tokens cap, empty prompt). Three to five
focused tests is plenty — the framework already pins the lifecycle
state machine and cancellation contract with Rust-side tests in
lib/backend-common.
Three moving parts need to come up: NATS + etcd (discovery and the event/request planes), the Dynamo frontend (HTTP → backend discovery), and your backend.
Then send a request:
A successful response has non-empty choices[0].message.content
and a finish_reason of stop or length.
jq -e '.choices[0].finish_reason' is a good one-liner for a CI
smoke test.
If your backend looks silent, set DYN_LOG=info (or
DYN_LOG=debug,dynamo=debug for finer scoping) before launching —
the framework configures tracing from DYN_LOG.
sample_engine.py
is the canonical minimal reference. Run it as-is:
It generates rotating token IDs with no ML dependencies, so it’s a useful stand-in for AIPerf / end-to-end pipeline smoke tests. Lift these patterns:
from_args parses CLI args and returns (engine, WorkerConfig)
with no awaits.start() returns an EngineConfig whose KV fields are
illustrative but not load-bearing (no real KV cache).generate() polls context.is_stopped() between yields and
emits a cancelled terminal on observation.cleanup() is a no-op because the engine holds no resources.Before shipping:
LLMEngine subclassed; from_args returns
(engine, WorkerConfig).start() returns EngineConfig with at least a non-empty
model.generate() polls context.is_stopped() between yields and
emits a "cancelled" terminal on observation.finish_reason and completion_usage.DynamoException subclasses used for error reporting
where the category matters.cleanup() releases all engine resources.LLMEngine ABC
— authoritative contract.