New — Dynamo’s unified backend. This guide covers the new unified backend infrastructure in
dynamo-backend-common: a sharedLLMEnginecontract that vLLM, SGLang, TRT-LLM, and the mocker already implement, and that any custom engine can plug into the same way.Beta — actively under development. The Rust native backend surface is beta quality and may change without backwards compatibility between releases. See Feature gaps below for what the unified path covers today versus the existing (non-unified) backend paths.
This guide walks through building a Rust unified backend for an
inference engine that plugs into Dynamo’s distributed runtime. A
unified backend is a standalone Rust binary that owns its engine and
serves requests via the shared
LLMEngine contract in
dynamo-backend-common — no Python
worker runtime required. For the Python version of the same contract
see Writing a Python Unified Backend.
Your backend lives in its own crate and does not need to be part of
the dynamo repository. It pulls dynamo-backend-common in as a normal
git or path dependency. The steps below assume you’re starting a fresh
crate in your own repo; an optional note in Step 1 covers the in-tree
variant for contributors landing a backend inside ai-dynamo/dynamo.
For a Python engine, use Writing a Python Unified Backend — same contract, lighter setup. The non-unified fallback for feature gaps (multimodal, LoRA, logprobs, etc.) is Python-only; see Writing Python Workers if you need one of those today.
The reference example is the mocker backend at
lib/backend-common/examples/mocker
— a small, complete, pure-Rust implementation. Read it alongside this
guide.
Where to look for what:
LLMEngine trait doc comments
— authoritative method-by-method contract.backend-common design notes
— rationale and invariants.The unified backend is in beta. The summary below is the common
contract — what every engine on the unified path gets, whether
written in Rust directly or plugged in from Python via the PyO3
Worker shim. Per-engine specifics (vLLM sleep/wake, SGLang
diffusion, TRT-LLM custom logits processors, etc.) live in the
Python package README.
Supported today
Lifecycle and runtime:
Aggregated / Prefill / Decode) — KV
transfer uses NIXL across all production engines; SGLang exchanges
a Dynamo-level bootstrap address, vLLM and TRT-LLM use an
engine-internal handshake. The Rust
mocker example
exercises the same wire format CPU-onlyctx.is_stopped() polling plus
the framework’s out-of-band abort() monitordrain() hook for pre-cleanup workDynamoError with ErrorType::Backend(BackendError::X)testing::run_conformance kitObservability:
LLMEngine::health_check_payload() plus
the operator override (DYN_HEALTH_CHECK_PAYLOAD /
--health-check-payload)/metrics output via
LLMEngine::setup_metrics(), plus framework-owned lifecycle
gauges (dynamo_component_{cleanup_time_seconds, drain_time_seconds, model_load_time_seconds}) and per-rank
dynamo_component_* gauges driven by SnapshotPublisherkv_event_sources() returning
KvEventSource::Zmq or KvEventSource::PushEngineConfig::data_parallel_size /
data_parallel_start_rank; read the router-forced rank off
request.routing.dp_rank in generate()engine.generate span around every generate() call with
attributes for model / input_tokens / disagg_role / ttft_ms
/ output_tokens / finish_reason / ITL percentiles. Static-name
spans opened with tracing::info_span! inside generate() nest
under it automatically; for dynamic span names use
dynamo_backend_common::telemetry::start_span(name). For outbound
calls that need to carry trace context (custom HTTP/TCP
transports), use
dynamo_runtime::logging::inject_trace_headers_into_map. NATS
egress is auto-injected — engines do nothing.Request handling:
SamplingOptions::guided_decoding (GuidedDecodingOptions);
engine-side coverage on the existing Python-bridged engines is:
vLLM and TRT-LLM forward JSON schema / regex / grammar / choice;
SGLang forwards JSON schema only (regex / grammar / choice are
silently dropped today). A new Rust engine should forward whichever
variants its backend supportsWorkerConfig::structural_tag_{mode, scope, schema} (typed enums)WorkerConfig::custom_jinja_template
flows to LocalModelBuilder::custom_template_path and the
frontend applies the template at preprocessing timeWorkerConfig
(tool_call_parser, reasoning_parser,
exclude_tools_when_tool_choice_none)Not yet on the unified path (common to all engines)
If you need one of these features today, keep that workload on the existing per-engine entry point until the unified path catches up.
A backend is two things:
LLMEngine trait — owns the
model, accepts preprocessed token requests, streams output tokens.main.rs entry point — a three-line shim that hands the
engine to dynamo_backend_common::run, which drives the lifecycle.The dynamo-backend-common crate handles everything else: signal
handling, model registration with discovery, the serving loop, graceful
shutdown, metrics, cancellation plumbing, and the debug-mode contract
validator.
Engines work directly with PreprocessedRequest and LLMEngineOutput
— the same types used by Dynamo’s preprocessing, routing, and frontend.
No Python-shaped translation layer.
feature edition2024 is required deep inside the build.deploy/docker-compose.yml brings up both in one command if you
don’t already have them running.async Rust, tokio, and clap. The trait uses
async_trait, and the framework expects a tokio runtime.Your backend is a standalone Rust binary crate. It can live in its own repository — the dynamo repo is not required to be your parent workspace. Pick whatever layout you prefer:
cargo new --bin my-backend is the fastest starting point; add
src/engine.rs yourself afterwards.
dynamo-backend-common cratedynamo-backend-common lives in the
ai-dynamo/dynamo repository and
is not on crates.io. Depend on it via git:
The testing feature pulls in the conformance kit used in Step 7.
Pick a SHA with:
No release tags yet.
dynamo-backend-commonlanded after the last tagged release (v1.1.1), sotag = "v1.1.1"won’t resolve the crate. Trackmainor pin to a specific SHA until a release tag ships that includes the crate.
These are easy to miss and surface as confusing compile errors deep
inside dynamo-runtime:
tokio_unstable cfg flag. dynamo-runtime uses tokio’s
unstable runtime-metrics API. Create .cargo/config.toml in your
crate root:
Without it, you’ll see errors like method blocking_queue_depth not found on RuntimeMetrics while compiling dynamo-runtime.
Rust toolchain pin. Match dynamo’s toolchain so workspace-edition
crates compile. Create rust-toolchain.toml:
Older toolchains fail with feature edition2024 is required.
Tip — local development: while iterating against an unreleased change in
dynamo-backend-common, point the dep at a local clone:dynamo-backend-common = { path = "/path/to/dynamo/lib/backend-common" }. Switch back to the git dep before publishing your crate.
If you’d rather develop inside the dynamo workspace as a new
sub-crate, drop your crate under dynamo/lib/ and use
dynamo-backend-common = { workspace = true } instead. The trait
contract is identical, and the .cargo/config.toml plus toolchain
pin in the dynamo repo cover the two requirements above for you.
In src/engine.rs (or whatever you named it), declare a struct that
owns whatever state your engine needs. Anything you allocate inside
start() later must live behind interior mutability so the trait’s
&self methods can reach it.
async-trait lets the trait use async fn (still required for
object-safety with Arc<dyn LLMEngine>); async-stream’s stream!
macro lets the generate body yield items from inside an async block.
The mocker example uses OnceCell for Inner; RwLock<Option<_>>
also works — pick whichever fits your shutdown semantics.
Every backend’s CLI shares a common base (--namespace, --component,
--endpoint, etc.) provided by CommonArgs. Flatten that into your
engine’s Args struct and add your engine-specific flags.
Define an inherent from_args constructor that parses the args and
returns both the engine and a WorkerConfig. from_args is not on
the trait — it stays inherent so the trait can remain object-safe
(Arc<dyn LLMEngine> must work).
The snippet below calls a tiny invalid_arg helper that builds a
typed BackendError::InvalidArgument. Its full definition lives in
Step 6 — for now, mentally substitute “any function that returns a
DynamoError with category InvalidArgument.”
WorkerConfig::default() sets model_input to ModelInput::Tokens,
which is the only mode Worker currently supports — the framework
validates this at startup. Engines needing raw text or tensor inputs
aren’t supported yet.
If your engine branches on the disaggregation role inside generate
(prefill vs decode), keep the same DisaggregationMode on your engine
struct so the runtime registration (WorkerConfig) and the per-request
dispatch stay in lockstep.
LLMEngine traitThe trait has three required methods (start, generate, cleanup)
plus two with default implementations you can override (abort, drain).
start()Start the engine and return EngineConfig metadata. After this
returns, the engine MUST be ready for concurrent generate() calls.
Use interior mutability for anything you initialize here.
worker_id is an opaque per-worker identifier — most engines ignore
it with _worker_id. Backends needing a stable cluster-wide key
(e.g. TRT-LLM’s disagg_machine_id snowflake) should derive from it.
Every EngineConfig field except model is optional. None means
“don’t advertise”; KV-aware routing falls back to round-robin when KV
fields are unset. Engines wrapping an external runtime can read these
values from the live engine after it comes up, instead of hard-coding
them. The ..Default::default() is load-bearing: EngineConfig
sometimes grows new fields (e.g. bootstrap_host/bootstrap_port
for SGLang disagg) and the default keeps existing engines compiling.
generate()Yield a stream of Result<LLMEngineOutput, DynamoError> items for a
single request. Called concurrently for multiple in-flight requests.
ctx: GenerateContext is a thin wrapper that Derefs to
dyn AsyncEngineContext, so the cancellation methods (stopped(),
is_stopped(), id()) you’d expect are still there. The wrapper
additionally exposes notify_first_token() for decode-mode requests
— most engines can ignore it; the framework auto-fires on the first
non-empty chunk.
Contract (the debug-mode validator panics on violations):
Ok(chunk) with finish_reason set, or an
Err(DynamoError). No items may be yielded after a terminal.chunk::token(id) and leave finish_reason
unset.'static: clone or move any state from
&self or request into the stream body before constructing it.Terminal chunks come from one of four LLMEngineOutput constructors,
optionally chained with the LLMEngineOutputExt setters
(.with_tokens(...), .with_usage(...)):
LLMEngineOutput::stop() — natural completion (e.g. you reached your
echo limit, the engine hit a stop string).LLMEngineOutput::length() — max_tokens cap reached.LLMEngineOutput::cancelled() — you observed ctx.stopped().LLMEngineOutput::error(msg) — message-only error terminal (loses
the typed BackendError variant — yield Err(DynamoError) instead
when the category matters).Non-terminal chunks use chunk::token(id) (single-token convenience).
A streaming-generate template:
biased is load-bearing for the channel-receiving pattern above:
ctx.stopped() and
rx.recv() -> None simultaneously; biased picks the clean
cancellation path instead of erroring on a closed channel. The
mocker’s stream body
spells this out.If your engine doesn’t have a receiver — e.g. you’re computing tokens inline like a deterministic echo backend — the body collapses to a plain loop that polls cancellation between yields:
No channel-close race to worry about; biased is still cheap and
recommended for consistency.
Cancellation rules:
ctx.is_stopped() (or await ctx.stopped())
between yields.FinishReason::Cancelled — not
Length or Stop. The conformance kit treats any other terminal
after cancellation as ignoring the signal.Typed errors vs. string errors:
Use typed errors when the failure category matters to the caller. Use string errors when it doesn’t.
abort() and per-request cleanupabort is called by the framework only when ctx.stopped() or
ctx.killed() fires — i.e. an explicit client/operator cancel. It is
NOT called when the stream is silently dropped (TCP reset, consumer
timeout without cancellation).
For cleanup that must run on any drop path (releasing a scheduler
slot, freeing a request handle), use RAII inside the generate stream
body:
The mocker’s ActiveRequestGuard is the canonical example.
Use abort only for out-of-band notifications (e.g. telling a remote
scheduler to stop computing for this request).
drain() and cleanup()drain() runs once before shutdown, after the discovery
unregister + grace-period sleep, while NATS/etcd are still alive.
Use it for backend-side draining that must complete before the
transport layer goes away (e.g. in-flight NIXL KV transfers on
prefill workers). Default is no-op.cleanup() is called once on shutdown. Release all engine
resources. The framework guarantees cleanup() runs exactly once if
start() succeeded — even if registration or serve fails afterward.Make cleanup() idempotent and tolerant of being called from a
half-initialized state. Engines like vLLM/TRT-LLM tear down NCCL groups
in cleanup() and a second attempt can hang.
main.rsThree lines. That’s it.
run installs signal handlers, builds the distributed runtime, calls
engine.start(), registers the model with discovery, serves the
endpoint, and runs the full graceful-shutdown orchestrator on
SIGTERM/SIGINT.
Errors: every error returned from start, generate, cleanup,
and from_args uses ErrorType::Backend(BackendError::X). From the
frontend’s perspective, anything bubbling up through the backend has
“originated from the backend” — engine code vs. framework code is not
distinguished. Top-level ErrorType::X variants are reserved for
non-backend paths.
A small helper module per backend keeps the call sites clean:
Common nested categories: InvalidArgument, CannotConnect,
EngineShutdown, StreamIncomplete, Cancelled, ResponseTimeout,
Disconnected, ConnectionTimeout, Unknown.
Logging: keep levels consistent across Rust backends so operators see the same surface everywhere.
tracing::info! for lifecycle milestones (engine started, cleanup
complete). Worker already logs “Serving {model} on …” and “Engine
cleanup complete” — add your own only for events those don’t cover.tracing::debug! for per-request events (cancellation, abort).tracing::warn! for recoverable problems.tracing::error! only for unrecoverable failures.Before merging, prove your engine satisfies the contract. The conformance kit is one call:
The kit runs start/generate/cleanup directly against your engine
— no external service is involved. If your engine needs a real GPU,
remote model server, or other heavyweight resource to construct, gate
the test with #[ignore] and require an explicit opt-in env var.
What it asserts:
For tests that don’t need a real engine, use testing::mock_context()
or testing::cancelling_context(after) to drive generate manually.
Three moving parts need to come up: NATS + etcd (discovery and the event/request planes), the Dynamo Python frontend (HTTP → backend discovery), and your backend.
The fastest path is to copy the mocker example’s
docker-compose.yml
and Dockerfile.frontend,
swap in your image, and run docker compose up --build. That brings
up NATS + etcd + the Python frontend (built from the dynamo workspace
at the same SHA as your backend) + your backend, all on one network.
For a non-Docker dev loop:
Then send a request:
A successful response has non-empty choices[0].message.content and a
finish_reason of stop or length. jq -e '.choices[0].finish_reason'
is a good one-liner for a CI smoke test.
run initializes tracing from the DYN_LOG env var (defaults to
info); set DYN_LOG=debug or
DYN_LOG=info,dynamo_backend_common=trace for more detail. RUST_LOG
is not honored — DYN_LOG replaces it.
lib/backend-common/examples/mocker
is the canonical small-but-complete reference. Lift these patterns:
mpsc channels.ActiveRequestGuard for RAII cleanup that runs on any stream drop.biased select with ctx.stopped() first, channel second — the
shutdown-race fix discussed in Step 4.cleanup() signals every active stream via ctx.stop_generating()
so each yields a clean Cancelled terminal instead of an error from
channel-close.Before shipping:
LLMEngine implemented; from_args is inherent (not on the trait).ErrorType::Backend(BackendError::X).generate polls ctx.is_stopped() between yields and emits
FinishReason::Cancelled on cancel.abort.cleanup is idempotent.testing::run_conformance(|| ...).LLMEngine trait — authoritative contract.Worker — runtime
lifecycle internals (signal handling, graceful shutdown, model
registration).run_conformance, mock_context, cancelling_context.