Writing a Rust Unified Backend
Writing a Rust Unified Backend
New — Dynamo’s unified backend. This guide covers the new unified backend infrastructure in
dynamo-backend-common: a sharedLLMEnginecontract that vLLM, SGLang, TRT-LLM, and the mocker already implement, and that any custom engine can plug into the same way.Beta — actively under development. The Rust native backend surface is beta quality and may change without backwards compatibility between releases. See Feature gaps below for what the unified path covers today versus the existing (non-unified) backend paths.
This guide walks through building a Rust unified backend for an
inference engine that plugs into Dynamo’s distributed runtime. A
unified backend is a standalone Rust binary that owns its engine and
serves requests via the shared
LLMEngine contract in
dynamo-backend-common — no Python
worker runtime required. For the Python version of the same contract
see Writing a Python Unified Backend.
Your backend lives in its own crate and does not need to be part of
the dynamo repository. It pulls dynamo-backend-common in as a normal
git or path dependency. The steps below assume you’re starting a fresh
crate in your own repo; an optional note in Step 1 covers the in-tree
variant for contributors landing a backend inside ai-dynamo/dynamo.
For a Python engine, use Writing a Python Unified Backend — same contract, lighter setup. The non-unified fallback for feature gaps (multimodal, LoRA, logprobs, etc.) is Python-only; see Writing Python Workers if you need one of those today.
The reference example is the mocker backend at
lib/backend-common/examples/mocker
— a small, complete, pure-Rust implementation. Read it alongside this
guide.
Where to look for what:
- This guide — step-by-step walkthrough for someone starting a new backend from scratch.
LLMEnginetrait doc comments — authoritative method-by-method contract.- Crate README — in-tree reference: architecture, file index, disaggregation contract, error taxonomy, conformance kit.
backend-commondesign notes — rationale and invariants.
Feature gaps
The unified backend is in beta and does not yet cover the full feature
set of Dynamo’s existing (non-unified) backend paths. The summary
below is the common contract — what every engine on the unified path
gets, whether written in Rust directly or plugged in from Python via
the PyO3 Worker shim. Per-engine gaps (vLLM, SGLang, TRT-LLM
specifics like LoRA, diffusion, attention DP scheduling) live in the
Python package README.
Supported today
- Aggregated token-in-token-out inference
- Disaggregated serving (
Aggregated/Prefill/Decode) with bootstrap (SGLang) or internal KV transport (vLLM, TRT-LLM); the Rust mocker example exercises the same wire format CPU-only - Model registration with discovery and endpoint types
- Request cancellation via in-stream
ctx.is_stopped()polling plus the framework’s out-of-bandabort()monitor - Typed
DynamoErrorwithErrorType::Backend(BackendError::X) - Graceful shutdown with signal handling,
drain()hook, and 3-phase distributed-runtime teardown - Debug-build stream validator and the
testing::run_conformancekit
Not yet on the unified path
If you need one of these features today, keep that workload on the existing per-engine entry point until the unified path catches up.
What you’re building
A backend is two things:
- An engine type that implements the
LLMEnginetrait — owns the model, accepts preprocessed token requests, streams output tokens. - A
main.rsentry point — a three-line shim that hands the engine todynamo_backend_common::run, which drives the lifecycle.
The dynamo-backend-common crate handles everything else: signal
handling, model registration with discovery, the serving loop, graceful
shutdown, metrics, cancellation plumbing, and the debug-mode contract
validator.
Engines work directly with PreprocessedRequest and LLMEngineOutput
— the same types used by Dynamo’s preprocessing, routing, and frontend.
No Python-shaped translation layer.
Prerequisites
- Rust 1.85 or newer (the dynamo workspace is edition 2024). The
toolchain pin in Step 1 locks this in for you; older toolchains will
fail with
feature edition2024 is requireddeep inside the build. - NATS and etcd reachable for end-to-end runs. The dynamo repo’s
deploy/docker-compose.ymlbrings up both in one command if you don’t already have them running. - Familiarity with
asyncRust,tokio, andclap. The trait usesasync_trait, and the framework expects atokioruntime.
Step 1: Create the crate
Your backend is a standalone Rust binary crate. It can live in its own repository — the dynamo repo is not required to be your parent workspace. Pick whatever layout you prefer:
cargo new --bin my-backend is the fastest starting point; add
src/engine.rs yourself afterwards.
Getting the dynamo-backend-common crate
dynamo-backend-common lives in the
ai-dynamo/dynamo repository and
is not on crates.io. Depend on it via git:
The testing feature pulls in the conformance kit used in Step 7.
Pick a SHA with:
No release tags yet.
dynamo-backend-commonlanded after the last tagged release (v1.1.1), sotag = "v1.1.1"won’t resolve the crate. Trackmainor pin to a specific SHA until a release tag ships that includes the crate.
Two build-time requirements you cannot skip
These are easy to miss and surface as confusing compile errors deep
inside dynamo-runtime:
-
tokio_unstablecfg flag.dynamo-runtimeuses tokio’s unstable runtime-metrics API. Create.cargo/config.tomlin your crate root:Without it, you’ll see errors like
method blocking_queue_depth not found on RuntimeMetricswhile compilingdynamo-runtime. -
Rust toolchain pin. Match dynamo’s toolchain so workspace-edition crates compile. Create
rust-toolchain.toml:Older toolchains fail with
feature edition2024 is required.
Tip — local development: while iterating against an unreleased change in
dynamo-backend-common, point the dep at a local clone:dynamo-backend-common = { path = "/path/to/dynamo/lib/backend-common" }. Switch back to the git dep before publishing your crate.
If you’d rather develop inside the dynamo workspace as a new
sub-crate, drop your crate under dynamo/lib/ and use
dynamo-backend-common = { workspace = true } instead. The trait
contract is identical, and the .cargo/config.toml plus toolchain
pin in the dynamo repo cover the two requirements above for you.
Step 2: Define your engine struct
In src/engine.rs (or whatever you named it), declare a struct that
owns whatever state your engine needs. Anything you allocate inside
start() later must live behind interior mutability so the trait’s
&self methods can reach it.
async-trait lets the trait use async fn (still required for
object-safety with Arc<dyn LLMEngine>); async-stream’s stream!
macro lets the generate body yield items from inside an async block.
The mocker example uses OnceCell for Inner; RwLock<Option<_>>
also works — pick whichever fits your shutdown semantics.
Step 3: Wire up CLI arguments
Every backend’s CLI shares a common base (--namespace, --component,
--endpoint, etc.) provided by CommonArgs. Flatten that into your
engine’s Args struct and add your engine-specific flags.
Define an inherent from_args constructor that parses the args and
returns both the engine and a WorkerConfig. from_args is not on
the trait — it stays inherent so the trait can remain object-safe
(Arc<dyn LLMEngine> must work).
The snippet below calls a tiny invalid_arg helper that builds a
typed BackendError::InvalidArgument. Its full definition lives in
Step 6 — for now, mentally substitute “any function that returns a
DynamoError with category InvalidArgument.”
WorkerConfig::default() sets model_input to ModelInput::Tokens,
which is the only mode Worker currently supports — the framework
validates this at startup. Engines needing raw text or tensor inputs
aren’t supported yet.
If your engine branches on the disaggregation role inside generate
(prefill vs decode), keep the same DisaggregationMode on your engine
struct so the runtime registration (WorkerConfig) and the per-request
dispatch stay in lockstep.
Step 4: Implement the LLMEngine trait
The trait has three required methods (start, generate, cleanup)
plus two with default implementations you can override (abort, drain).
start()
Start the engine and return EngineConfig metadata. After this
returns, the engine MUST be ready for concurrent generate() calls.
Use interior mutability for anything you initialize here.
worker_id is an opaque per-worker identifier — most engines ignore
it with _worker_id. Backends needing a stable cluster-wide key
(e.g. TRT-LLM’s disagg_machine_id snowflake) should derive from it.
Every EngineConfig field except model is optional. None means
“don’t advertise”; KV-aware routing falls back to round-robin when KV
fields are unset. Engines wrapping an external runtime can read these
values from the live engine after it comes up, instead of hard-coding
them. The ..Default::default() is load-bearing: EngineConfig
sometimes grows new fields (e.g. bootstrap_host/bootstrap_port
for SGLang disagg) and the default keeps existing engines compiling.
generate()
Yield a stream of Result<LLMEngineOutput, DynamoError> items for a
single request. Called concurrently for multiple in-flight requests.
ctx: GenerateContext is a thin wrapper that Derefs to
dyn AsyncEngineContext, so the cancellation methods (stopped(),
is_stopped(), id()) you’d expect are still there. The wrapper
additionally exposes notify_first_token() for decode-mode requests
— most engines can ignore it; the framework auto-fires on the first
non-empty chunk.
Contract (the debug-mode validator panics on violations):
- Exactly one terminal item must be the last item yielded. A
terminal is either an
Ok(chunk)withfinish_reasonset, or anErr(DynamoError). No items may be yielded after a terminal. - Non-terminal chunks use
chunk::token(id)and leavefinish_reasonunset. - The returned stream is
'static: clone or move any state from&selforrequestinto the stream body before constructing it.
Terminal chunks come from one of four LLMEngineOutput constructors,
optionally chained with the LLMEngineOutputExt setters
(.with_tokens(...), .with_usage(...)):
LLMEngineOutput::stop()— natural completion (e.g. you reached your echo limit, the engine hit a stop string).LLMEngineOutput::length()—max_tokenscap reached.LLMEngineOutput::cancelled()— you observedctx.stopped().LLMEngineOutput::error(msg)— message-only error terminal (loses the typedBackendErrorvariant — yieldErr(DynamoError)instead when the category matters).
Non-terminal chunks use chunk::token(id) (single-token convenience).
A streaming-generate template:
biased is load-bearing for the channel-receiving pattern above:
- When cancellation and a pending token are both ready, yield the cancellation, not one more token.
- During cleanup the stream sees both
ctx.stopped()andrx.recv() -> Nonesimultaneously;biasedpicks the clean cancellation path instead of erroring on a closed channel. The mocker’s stream body spells this out.
If your engine doesn’t have a receiver — e.g. you’re computing tokens inline like a deterministic echo backend — the body collapses to a plain loop that polls cancellation between yields:
No channel-close race to worry about; biased is still cheap and
recommended for consistency.
Cancellation rules:
- The stream must poll
ctx.is_stopped()(orawait ctx.stopped()) between yields. - On cancellation, emit a terminal with
FinishReason::Cancelled— notLengthorStop. The conformance kit treats any other terminal after cancellation as ignoring the signal.
Typed errors vs. string errors:
Use typed errors when the failure category matters to the caller. Use string errors when it doesn’t.
abort() and per-request cleanup
abort is called by the framework only when ctx.stopped() or
ctx.killed() fires — i.e. an explicit client/operator cancel. It is
NOT called when the stream is silently dropped (TCP reset, consumer
timeout without cancellation).
For cleanup that must run on any drop path (releasing a scheduler
slot, freeing a request handle), use RAII inside the generate stream
body:
The mocker’s ActiveRequestGuard is the canonical example.
Use abort only for out-of-band notifications (e.g. telling a remote
scheduler to stop computing for this request).
drain() and cleanup()
drain()runs once before shutdown, after the discovery unregister + grace-period sleep, while NATS/etcd are still alive. Use it for backend-side draining that must complete before the transport layer goes away (e.g. in-flight NIXL KV transfers on prefill workers). Default is no-op.cleanup()is called once on shutdown. Release all engine resources. The framework guaranteescleanup()runs exactly once ifstart()succeeded — even if registration or serve fails afterward.
Make cleanup() idempotent and tolerant of being called from a
half-initialized state. Engines like vLLM/TRT-LLM tear down NCCL groups
in cleanup() and a second attempt can hang.
Step 5: Write main.rs
Three lines. That’s it.
run installs signal handlers, builds the distributed runtime, calls
engine.start(), registers the model with discovery, serves the
endpoint, and runs the full graceful-shutdown orchestrator on
SIGTERM/SIGINT.
Step 6: Errors and logging
Errors: every error returned from start, generate, cleanup,
and from_args uses ErrorType::Backend(BackendError::X). From the
frontend’s perspective, anything bubbling up through the backend has
“originated from the backend” — engine code vs. framework code is not
distinguished. Top-level ErrorType::X variants are reserved for
non-backend paths.
A small helper module per backend keeps the call sites clean:
Common nested categories: InvalidArgument, CannotConnect,
EngineShutdown, StreamIncomplete, Cancelled, ResponseTimeout,
Disconnected, ConnectionTimeout, Unknown.
Logging: keep levels consistent across Rust backends so operators see the same surface everywhere.
tracing::info!for lifecycle milestones (engine started, cleanup complete).Workeralready logs “Serving {model} on …” and “Engine cleanup complete” — add your own only for events those don’t cover.tracing::debug!for per-request events (cancellation, abort).tracing::warn!for recoverable problems.tracing::error!only for unrecoverable failures.
Step 7: Run the conformance kit
Before merging, prove your engine satisfies the contract. The conformance kit is one call:
The kit runs start/generate/cleanup directly against your engine
— no external service is involved. If your engine needs a real GPU,
remote model server, or other heavyweight resource to construct, gate
the test with #[ignore] and require an explicit opt-in env var.
What it asserts:
For tests that don’t need a real engine, use testing::mock_context()
or testing::cancelling_context(after) to drive generate manually.
Step 8: Run it locally
Three moving parts need to come up: NATS + etcd (discovery and the event/request planes), the Dynamo Python frontend (HTTP → backend discovery), and your backend.
The fastest path is to copy the mocker example’s
docker-compose.yml
and Dockerfile.frontend,
swap in your image, and run docker compose up --build. That brings
up NATS + etcd + the Python frontend (built from the dynamo workspace
at the same SHA as your backend) + your backend, all on one network.
For a non-Docker dev loop:
Then send a request:
A successful response has non-empty choices[0].message.content and a
finish_reason of stop or length. jq -e '.choices[0].finish_reason'
is a good one-liner for a CI smoke test.
run initializes tracing from the DYN_LOG env var (defaults to
info); set DYN_LOG=debug or
DYN_LOG=info,dynamo_backend_common=trace for more detail. RUST_LOG
is not honored — DYN_LOG replaces it.
Reference: the mocker backend
lib/backend-common/examples/mocker
is the canonical small-but-complete reference. Lift these patterns:
- Single shared scheduler driving many concurrent streams via a
fan-out task and per-request
mpscchannels. ActiveRequestGuardfor RAII cleanup that runs on any stream drop.biasedselect withctx.stopped()first, channel second — the shutdown-race fix discussed in Step 4.cleanup()signals every active stream viactx.stop_generating()so each yields a cleanCancelledterminal instead of an error from channel-close.
Checklist
Before shipping:
-
LLMEngineimplemented;from_argsis inherent (not on the trait). - All errors use
ErrorType::Backend(BackendError::X). -
generatepollsctx.is_stopped()between yields and emitsFinishReason::Cancelledon cancel. - Per-request cleanup uses RAII guards, not just
abort. -
cleanupis idempotent. - Conformance kit runs green:
testing::run_conformance(|| ...). - Logging levels match the standards in Step 6.
See also
- Crate README — in-tree reference (architecture, file index, contracts at a glance).
LLMEnginetrait — authoritative contract.- Design notes — rationale and invariants.
Worker— runtime lifecycle internals (signal handling, graceful shutdown, model registration).- Conformance kit —
run_conformance,mock_context,cancelling_context. - Mocker backend — example user guide.
- Python sibling — Python ABC layered over this crate.