Custom Guided Decoding Backend (Experimental)#
While structured generation enables a variety of methods to generate constrained outputs, NIM also includes experimental support for custom guided decoding backends in cases where structured generation is not be enough. You can define your own logits processor in Python which are dynamically loaded at runtime to constrain the generation.
This is an experimental feature that is turned OFF by default. Loading custom code dynamically at runtime can lead to unexpected behavior such as performance degradation, incorrect outputs, or even application crash.
Loading the custom guided decoding backends#
The custom guided decoding backend directory needs to be mounted to the container at runtime. In addition the following environment variables need to be set:
NIM_TRUST_CUSTOM_CODE=1
NIM_CUSTOM_GUIDED_DECODING_BACKENDS=/path/to/mounted/custom/backends/directory
NIM_GUIDED_DECODING_BACKEND=/name/of/subdirectory/in/NIM_CUSTOM_GUIDED_DECODING_BACKENDS
Requests will useNIM_GUIDED_DECODING_BACKEND
by default. To launch the container, use the following command:
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v /local/path/to/custom/backends:/custom-backends \
-u $(id -u) \
-e NIM_TRUST_CUSTOM_CODE=1 \
-e NIM_CUSTOM_GUIDED_DECODING_BACKENDS=/custom-backends/ \
-e NIM_GUIDED_DECODING_BACKEND=my-custom-backend \
-p 8000:8000 \
$IMG_NAME
Specifying a custom backend at runtime#
NIM_GUIDED_DECODING_BACKEND
sets the default backend. You can specify the
guided decoding backend per request by setting
nvext.guided_decoding_backend=my-custom-backend
where my-custom-backend
is
the name of a subdirectory in NIM_CUSTOM_GUIDED_DECODING_BACKENDS
that holds
a custom backend definitions as outline in the next section.
example query:
{
"model": "my-model",
"prompt": "My prompt",
"top_p": 1,
"n": 1,
"frequency_penalty": 1.0,
"stream": false,
"max_tokens": 15,
"nvext": {
"guided_decoding_backend": "my-custom-backend"
}
}
Custom guided decoding backend specifications#
To launch the custom guided decoding backend you must provide the name of a directory that contains a single backend.py
file
and any other *.whl
Python wheel files that are required as additional dependencies, including transitive dependencies,
not already included in NIM.
The directory structure should look something like the following:
custom_backends/my-custom-backend
|___ backend.py
|___ my_dep_-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
The backend.py
file must contain a asynchronous function with the name get_custom_guided_decoding_logits_processor
and the following
function signature. It must return a valid LogitsProcessor
object.
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from vllm.sampling_params import LogitsProcessor
async def get_custom_guided_decoding_logits_processor(
request: Union[CompletionRequest, ChatCompletionRequest], tokenizer: PreTrainedTokenizer
) -> Optional[LogitsProcessor]:
Note that the function name, argument names, and the argument type hints must match exactly the shown signatures.
The following example backend.py
file contains a custom logits processor which only outputs the response string Custom decoding test
.
from typing import List, Optional, Union
import torch
from transformers import PreTrainedTokenizer
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from vllm.sampling_params import LogitsProcessor
RESPONSE = "Custom decoding test"
class SingleLetterLogitsProcessor:
'''
Sample Logits Processor that always outputs a hardcoded
esponse (`RESPONSE`), no matter the input
'''
def __init__(self, tokenizer: PreTrainedTokenizer):
self.tokenizer = tokenizer
self.token_ids = tokenizer.encode(RESPONSE)
self.eos_id = tokenizer.eos_token_id
self.state = 0
def __call__(self, input_ids: Union[List[int], torch.LongTensor], scores: torch.FloatTensor) -> torch.FloatTensor:
mask = torch.full_like(scores, float("-inf"))
if self.state < len(self.token_ids):
token_idx = self.token_ids[self.state]
else:
token_idx = self.eos_id
# Allow only a single token to be output
mask[token_idx] = 0.0
# The `scores` tensor *must* also be modified in-place
scores.add_(mask)
self.state += 1
return scores
async def get_custom_guided_decoding_logits_processor(
request: Union[CompletionRequest, ChatCompletionRequest], tokenizer: PreTrainedTokenizer
) -> Optional[LogitsProcessor]:
'''
The `backend.py` file must contain a function with
the name `get_custom_guided_decoding_logits_processor`
with this exact signature. It must take in as input
the `request` and the `tokenizer` with the above types.
The return type must be `Optional[LogitsProcessor]`
'''
return SingleLetterLogitsProcessor(tokenizer)
LogitsProcessor
specifications#
The scores
tensor must be modified in-place in addition to being returned, like so:
scores.add_(mask)
. Transfer of tensors between the CPU and GPU memory should be minimized to
avoid any performance hits.
Statement level postprocessing#
NIM also allows specific statement level postprocessing for the custom backend. To enable this, define a function get_guided_decoding_constrained_generator
with the exact function definition as follows, which updates the final response generator.
from vllm.entrypoints.openai.protocol import ChatCompletionResponse, CompletionResponse, ErrorResponse
from fastapi import Request
from typing import Union, AsyncGenerator
async def get_guided_decoding_constrained_generator(
response: Union[AsyncGenerator[str, None], CompletionResponse, ChatCompletionResponse, ErrorResponse],
raw_request: Request,
) -> Union[AsyncGenerator[str, None], CompletionResponse, ChatCompletionResponse, ErrorResponse]:
# Modify response here
updated_response = ...
return updated_response
Maintaining custom state#
NIM allows for the ability to store and maintain a custom state that could be used in a custom backend. To enable this, define a function set_custom_guided_decoding_parameters
in backend.py
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from typing import Union
async def set_custom_guided_decoding_parameters(request: Union[ChatCompletionRequest, CompletionRequest]) -> None:
# Set state
This function will be called once at the beginning before the (chat) completion is generated. It is the responsibility of the user code to maintain and fetch the state thereafter. As an example, the state may be set in the form of a Python ContextVar
from contextvars import ContextVar
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from typing import Union
my_context_var = ContextVar("my_state")
async def set_custom_guided_decoding_parameters(request: Union[ChatCompletionRequest, CompletionRequest]) -> None:
my_state = ...
my_context_var.set(my_state)
After NIM calls the user defined set_custom_guided_decoding_parameters
function, the user defined code may then access
and update the state stored in my_context_var
.
Other considerations#
It is crucial that you do not import any additional vllm
modules that are not required such as
vllm.model_executor.guided_decoding
or vllm.entrypoints.openai.api_server
that can interfere
with the dynamic imports and monkeypatching in NIM.
Note that custom guided decoding does no specific ordering of the output.