Is this page helpful?

Custom Logits Processing#

NIM LLM supports custom logits processing by passing through to vLLM’s native logits processor API. You can write your own processor, mount it into the container as a volume, and enable it with a single CLI flag.

Custom logits processors operate at the batch level. vLLM batches multiple requests into a single tensor for GPU efficiency, so your processor receives the entire batch together. However, you can toggle behavior per request by having clients pass custom arguments in vllm_xargs. Your processor checks each row’s extra_args and modifies only the rows that opt in. Each row in the tensor corresponds to one request.

Run a Custom Logits Processor#

To run a custom logits processor, complete the following steps:

Place your processor script in a local directory.
Volume-mount that directory into the container.

Pass the processor’s module path to NIM by using the --logits-processors CLI argument:

docker run --gpus all \
-v /home/user/my_processors:/opt/nim/my_processors \
-e PYTHONPATH=/opt/nim \
-e NIM_MODEL_PATH=hf://meta-llama/Llama-3.2-1B-Instruct \
-p 8000:8000 \
nim-llm:local \
nim-serve --logits-processors my_processors.token_filter:BadWordFilterProcessor

The --logits-processors value uses Python’s module.submodule:ClassName format. The code must be importable from PYTHONPATH.

Write a Custom Logits Processor#

Custom processors extend vllm.v1.sample.logits_processor.LogitsProcessor and implement the following five methods:

Method	Purpose
`validate_params`	Class method. Validates per-request parameters when a request arrives. Raise `ValueError` to reject the request.
`__init__`	Runs during server startup. Initialize any state your processor needs.
`apply`	Called every engine step. Receives the full logits tensor `(num_requests, vocab_size)` and returns the modified tensor.
`is_argmax_invariant`	Return `True` if your processor never changes which token has the highest logit. Return `False` otherwise.
`update_state`	Called every engine step with a `BatchUpdate` describing which requests were added, removed, or moved in the batch. Use this to track per-request state.

Per-Request Toggling#

Custom logits processors operate at the batch level, but clients can enable specific logits processors per request by passing custom arguments using vllm_xargs.

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "prompt": "Once upon a time...",
    "max_tokens": 64,
    "vllm_xargs": {"ban_token_id": 42}
  }'

For more details on invoking custom logits processors per request, refer to the vLLM documentation.