Custom Guided Decoding Backend (Experimental)#

While structured generation enables a variety of methods to generate constrained outputs, NIM also includes experimental support for custom guided decoding backends in cases where structured generation is not be enough. You can define your own logits processor in Python which are dynamically loaded at runtime to constrain the generation.

This is an experimental feature that is turned OFF by default. Loading custom code dynamically at runtime can lead to unexpected behavior such as performance degradation, incorrect outputs, or even application crash.

Loading the custom guided decoding backends#

The custom guided decoding backend directory needs to be mounted to the container at runtime. In addition the following environment variables need to be set:

  • NIM_TRUST_CUSTOM_CODE=1

  • NIM_CUSTOM_GUIDED_DECODING_BACKENDS=/path/to/mounted/custom/backends/directory

  • NIM_GUIDED_DECODING_BACKEND=/name/of/subdirectory/in/NIM_CUSTOM_GUIDED_DECODING_BACKENDS Requests will use NIM_GUIDED_DECODING_BACKEND by default. To launch the container, use the following command:

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v /local/path/to/custom/backends:/custom-backends \
  -u $(id -u) \
  -e NIM_TRUST_CUSTOM_CODE=1 \
  -e NIM_CUSTOM_GUIDED_DECODING_BACKENDS=/custom-backends/ \
  -e NIM_GUIDED_DECODING_BACKEND=my-custom-backend \
  -p 8000:8000 \
  $IMG_NAME

Specifying a custom backend at runtime#

NIM_GUIDED_DECODING_BACKEND sets the default backend. You can specify the guided decoding backend per request by setting nvext.guided_decoding_backend=my-custom-backend where my-custom-backend is the name of a subdirectory in NIM_CUSTOM_GUIDED_DECODING_BACKENDS that holds a custom backend definitions as outline in the next section.

example query:

{
    "model": "my-model",
    "prompt": "My prompt",
    "top_p": 1,
    "n": 1,
    "frequency_penalty": 1.0,
    "stream": false,
    "max_tokens": 15,
    "nvext": {
        "guided_decoding_backend": "my-custom-backend"
    }
}

Custom guided decoding backend specifications#

To launch the custom guided decoding backend you must provide the name of a directory that contains a single backend.py file and any other *.whl Python wheel files that are required as additional dependencies, including transitive dependencies, not already included in NIM. The directory structure should look something like the following:

custom_backends/my-custom-backend
|___ backend.py
|___ my_dep_-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

The backend.py file must contain a asynchronous function with the name get_custom_guided_decoding_logits_processor and the following function signature. It must return a valid LogitsProcessor object.

from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from vllm.sampling_params import LogitsProcessor
async def get_custom_guided_decoding_logits_processor(
    request: Union[CompletionRequest, ChatCompletionRequest], tokenizer: PreTrainedTokenizer
) -> Optional[LogitsProcessor]:

Note that the function name, argument names, and the argument type hints must match exactly the shown signatures.

The following example backend.py file contains a custom logits processor which only outputs the response string Custom decoding test.

from typing import List, Optional, Union

import torch
from transformers import PreTrainedTokenizer
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from vllm.sampling_params import LogitsProcessor

RESPONSE = "Custom decoding test"


class SingleLetterLogitsProcessor:
    '''
    Sample Logits Processor that always outputs a hardcoded
    esponse (`RESPONSE`), no matter the input
    '''

    def __init__(self, tokenizer: PreTrainedTokenizer):
        self.tokenizer = tokenizer
        self.token_ids = tokenizer.encode(RESPONSE)
        self.eos_id = tokenizer.eos_token_id
        self.state = 0

    def __call__(self, input_ids: Union[List[int], torch.LongTensor], scores: torch.FloatTensor) -> torch.FloatTensor:

        mask = torch.full_like(scores, float("-inf"))

        if self.state < len(self.token_ids):
            token_idx = self.token_ids[self.state]
        else:
            token_idx = self.eos_id
        # Allow only a single token to be output
        mask[token_idx] = 0.0

        # The `scores` tensor *must* also be modified in-place
        scores.add_(mask)
        self.state += 1

        return scores


async def get_custom_guided_decoding_logits_processor(
    request: Union[CompletionRequest, ChatCompletionRequest], tokenizer: PreTrainedTokenizer
) -> Optional[LogitsProcessor]:
    '''
    The `backend.py` file must contain a function with
    the name `get_custom_guided_decoding_logits_processor`
    with this exact signature. It must take in as input
    the `request` and the `tokenizer` with the above types.
    The return type must be `Optional[LogitsProcessor]`
    '''
    return SingleLetterLogitsProcessor(tokenizer)

LogitsProcessor specifications#

The scores tensor must be modified in-place in addition to being returned, like so: scores.add_(mask). Transfer of tensors between the CPU and GPU memory should be minimized to avoid any performance hits.

Statement level postprocessing#

NIM also allows specific statement level postprocessing for the custom backend. To enable this, define a function get_guided_decoding_constrained_generator with the exact function definition as follows, which updates the final response generator.

from vllm.entrypoints.openai.protocol import ChatCompletionResponse, CompletionResponse, ErrorResponse
from fastapi import Request
from typing import Union, AsyncGenerator

async def get_guided_decoding_constrained_generator(
    response: Union[AsyncGenerator[str, None], CompletionResponse, ChatCompletionResponse, ErrorResponse],
    raw_request: Request,
) -> Union[AsyncGenerator[str, None], CompletionResponse, ChatCompletionResponse, ErrorResponse]:
    # Modify response here
    updated_response = ...
    return updated_response

Maintaining custom state#

NIM allows for the ability to store and maintain a custom state that could be used in a custom backend. To enable this, define a function set_custom_guided_decoding_parameters in backend.py

from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from typing import Union

async def set_custom_guided_decoding_parameters(request: Union[ChatCompletionRequest, CompletionRequest]) -> None:
    # Set state

This function will be called once at the beginning before the (chat) completion is generated. It is the responsibility of the user code to maintain and fetch the state thereafter. As an example, the state may be set in the form of a Python ContextVar

from contextvars import ContextVar
from vllm.entrypoints.openai.protocol import ChatCompletionRequest, CompletionRequest
from typing import Union

my_context_var = ContextVar("my_state")

async def set_custom_guided_decoding_parameters(request: Union[ChatCompletionRequest, CompletionRequest]) -> None:
    my_state = ...
    my_context_var.set(my_state)

After NIM calls the user defined set_custom_guided_decoding_parameters function, the user defined code may then access and update the state stored in my_context_var.

Other considerations#

It is crucial that you do not import any additional vllm modules that are not required such as vllm.model_executor.guided_decoding or vllm.entrypoints.openai.api_server that can interfere with the dynamic imports and monkeypatching in NIM.

Note that custom guided decoding does no specific ordering of the output.