For general vLLM features and configuration, see the Reference Guide.
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for vLLM so you can plug in custom processors.
dynamo.logits_processing.BaseLogitsProcessor, which defines __call__(input_ids, logits) and modifies logits in-place.LogitsProcessorSpec (see dynamo.common.backend.engine). The shared logits_processors_for_request helper owns the generation-stage gating (activate only on AGGREGATED / DECODE) and the per-request freshness policy.SamplingParams.extra_args (vLLM’s vllm_xargs). Dynamo’s adapter lives at dynamo.vllm.logits_processing.adapter.DYN_ENABLE_TEST_LOGITS_PROCESSOR=1 is a built-in test hook (not a production processor loader) that forces the model to respond with “Hello world!”. It verifies the callback path without modifying your model or engine code:
Send a normal chat/completions request; the response should contain “Hello world!”.
The quick test targets aggregated deployments. In disaggregated mode the prefill worker emits one token before decode resumes, and the test processor has per-request state. The unified backend skips the test hook on the prefill role (the shared generation-stage gating returns no entries there), but the decode-side output can still be affected by the prefill-produced leading token. Use aggregated mode to verify the wiring.
The unified vLLM engine threads logits processors through the shared spec layer in dynamo.common.backend.engine and the per-backend realizer at dynamo.vllm.logits_processing.adapter:
start() registers the engine-loaded adapter (DynamoVllmLogitsProcessor) onto engine_args.logits_processors before building the engine config — but only when the env hook is on and the worker is a generation role (AGGREGATED / DECODE). Production paths leave logits_processors untouched. After the engine (and tokenizer) is up, it resolves a LogitsProcessorSpec once via resolve_test_logits_processor_spec, tokenizing "Hello world!" into a ForcedTokenSequenceSpec with the token IDs already resolved. None when the env var is off or on a non-generation role.generate() calls logits_processors_for_request(spec, disaggregation_mode=...) to get the per-request entry list (empty on PREFILL or when spec is None), then activate_logits_processors(sampling_params, entries) serializes the entries into sampling_params.extra_args["dynamo_logits"].DynamoVllmLogitsProcessor.new_req_logits_processor(params) (called once per request by vLLM) reads extra_args["dynamo_logits"], realizes a fresh per-request ForcedSequenceLogitsProcessor, and returns a request callable that applies it. Requests with no activation return None, so vLLM skips them.The same shared layer hosts the TRT-LLM and SGLang slices; each backend translates the same LogitsProcessorSpec into its native shape. The public config-driven loader (when it lands) plugs in by resolving a LogitsProcessorSpec from CLI/config instead of from this env var; no engine code changes.
ForcedTokenSequenceSpec (pre-resolved token IDs). Arbitrary Dynamo BaseLogitsProcessor instances and a public import-string/plugin loader are deferred follow-ups (see the design doc).PythonProcessorSpec (the TRT-LLM in-process escape hatch wrapping a live callable) is not serializable, so the vLLM adapter rejects it.