nemo_curator.core.serve.dynamo.vllm
nemo_curator.core.serve.dynamo.vllm
Dynamo vLLM worker launch helpers for aggregated and disaggregated serving.
Module Contents
Functions
Data
API
Launch the N workers for a single disagg role (prefill or decode).
Spawn one python -m dynamo.vllm actor, pinned to bundle node_rank.
Rank 0 is the “real” worker (model registration + scheduler + KV events
publisher). Rank >0 is --headless — no scheduler, so KV events are
always disabled for it even if rank 0 publishes.
True if this aggregated model should publish ZMQ KV events.
JSON blob for --kv-events-config.
Always passed explicitly. Without this, Dynamo’s args.py auto-creates
a KVEventsConfig bound to tcp://*:20080 when prefix_caching is
enabled (vLLM >=0.16 default), causing every worker on the same node to
fight over the same port.
Merge the user’s runtime_env with the Dynamo-vLLM package pin.
Write the actor-venv --override file at a fixed path on every alive node.
The file pins ray=={ray.__version__} (read from the driver) so the
actor venv keeps the same ray patch as the cluster head — Ray rejects
any mismatch.
Must run inside an active Ray context, before any worker spawned with
:data:DYNAMO_VLLM_RUNTIME_ENV lands. The runtime_env_agent on each
worker reads the file from the node-local filesystem; a single
driver-side write doesn’t reach remote nodes.
Re-call after cluster topology changes (autoscale, node restart) — this is one-shot and not auto-triggered.
Plan PGs and launch every worker actor for one disagg model.
Each role (prefill/decode) becomes its own pool of single-bundle PGs
so roles can scale independently. Only the prefill pool publishes KV
events (decode reads them via Nixl). KV transfer defaults to
NixlConnector with kv_both unless the user overrides via
DynamoVLLMModelConfig.kv_transfer_config.
worker_index_offset lets the caller thread a global counter across
multiple disagg models so their port seeds don’t overlap — without it,
the first worker of every model lands on the same Nixl/KV-events seed
and same-node placement risks a bind race.
Plan PGs and launch every worker actor for one non-disagg model.
Returns (replica_pgs, worker_actors, manifest_entries); callers own
the returned handles and are responsible for teardown.
Merge every model’s runtime_env onto the Dynamo-vLLM pin for the shared frontend actor.
Plan a single-bundle PG spec for one disagg worker.
Disagg does not support multi-node TP — each role’s TP group must fit
on one node. Raise early if plan_replica_bundle_shape hands back
a multi-bundle (multi-node) spec.
Resolve (num_replicas, engine_kwargs) for one disagg role.
Role-level engine_kwargs merges over the model-wide
engine_kwargs so users can override only what they need per role
(for example a smaller TP on decode).