nemo_curator.core.serve.dynamo.backend
nemo_curator.core.serve.dynamo.backend
NVIDIA Dynamo inference backend.
Aggregated: one detached PG per replica carries its TP bundles. Disaggregated: one detached PG per prefill / decode worker, each single-bundle. A separate STRICT_PACK PG co-locates etcd, NATS, and the Dynamo frontend.
Module Contents
Classes
API
Bases: InferenceBackend
Dynamo backend for InferenceServer — aggregated serving on Ray PGs.
start()enters thenemo_curator_dynamonamespace, sweeps any leftover actors + PGs from a prior driver session, then deploys infra → workers → frontend and blocks on a/v1/modelshealth check.stop()re-enters the same namespace in a fresh Ray session; becauseActorHandleobjects do not survive aray.shutdown()boundary, the stored handles are refreshed by name before the parallel SIGTERM → SIGKILL teardown runs. Replica + infra PGs are then removed.
Detect subprocess exits via ray.wait() on the cached run refs.
Validate, create PGs, launch infra/workers/frontend, health-check.
Launch the Dynamo frontend bound to the infra node.
Emits --router-mode and --[no-]router-kv-events from the
resolved values; anything else in router_kwargs (temperature,
ttl_secs …) is forwarded verbatim via snake-to-kebab CLI flag
translation.
effective_router_mode / effective_router_kv_events let
_deploy_and_healthcheck pass in auto-resolved values (e.g.
"kv" + True when any model is disagg). When either is
None the corresponding typed router field is used verbatim.
Resolve (router_mode, router_kv_events) for the frontend.
mode: honorrouter.modeif set; otherwise auto-pick"kv"when any model usesmode="disagg", else leave unset so the Dynamo frontend falls back to its ownround_robindefault.kv_events: when we auto-pickmode="kv"we also auto-enablekv_eventsso the router consumes what prefill workers publish unconditionally in disagg. If the user setrouter.modeexplicitly (to any value) we honor theirrouter.kv_eventsas-is.
Reap any detached actors left behind by a prior driver session.
remove_named_pgs_with_prefix force-kills actors scheduled into
the reaped PGs, which would orphan the subprocess tree; sweeping
named actors first lets graceful_stop_actors killpg each
process group cleanly.
Parallel-stop every actor, then release the placement groups.
ActorHandle objects stored on self during start() belong to
that session’s Ray job and are invalid here (stop() opened its own
with ray.init()), so the handles are refreshed by detached-actor
name before any .remote() call is issued.
Coarse fail-fast on cluster-wide GPU over-commit and disagg TP fit.
Ray’s per-PG STRICT_PACK / STRICT_SPREAD is the authoritative
admission gate; this produces a better error than the admission timeout.
For disagg models we also reject configurations where a single role’s
TP group would not fit on one node — disagg does not support multi-node TP.
Reject duplicate model names and component-slug collisions.
Dynamo registers each worker under a dyn://namespace.component.endpoint
URI; duplicate model names (or names that sanitize to the same slug)
would silently overwrite each other inside etcd.
Poll /v1/models until all expected_models appear.