Sampling Control#

NIM for VLMs exposes a suite of sampling parameters that offers users fine-grained control over the generation behavior of VLMs. Below is a complete reference on how to configure the sampling parameters in an inference request.

Sampling Parameters: OpenAI API#

Params	Type	Default	Notes
presence_penalty	float	0.0	Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2].
frequency_penalty	float	0.0	Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2].
repetition_penalty	float	1.0	Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2].
temperature	float	1.0	Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling.
top_p	float	1.0	Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
top_k	int	-1	Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise.
min_p	float	0.0	Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable `min_p`.
seed	int	None	Random seed to use for the generation.
stop	str or List[str]	None	A string or list of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
ignore_eos	bool	False	Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Useful for performance benchmarking.
max_tokens	int	16	Maximum number of tokens to generate per output sequence. Must be >= 1.
min_tokens	int	0	Minimum number of tokens to generate per output sequence before EOS or `stop_token_ids` can be generated. Must be >= 0.
logprobs	int	None	Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to `logprob + 1` elements in the response. Must be >= 0.
prompt_logprobs	int	None	Number of log probabilities to return per prompt token. Must be >= 0.
response_format	Dict[str, str]	None	Specifies the format that the model must output. Set to `{'type': 'json_object'}` to enable JSON mode, which guarantees the output the model generates is valid JSON. See Structured generation.

Examples#

From command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "<model-name>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 256
    }'

Using the OpenAI Python API library:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="<model-name>",
    messages=messages,
    temperature=0.2,
    top_p=0.7,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Advanced: Guided Decoding#

NIM for VLMs additionally supports guided decoding for structured generation through nvext. See Structured Generation for example use cases.

Params	Type	Default	Notes
guided_json	str, dict, or Pydantic BaseModel	None	If specified, the output will follow the JSON schema.
guided_regex	str	None	If specified, the output will follow the regex pattern.
guided_choice	List[str]	None	If specified, the output will be exactly one of the choices.
guided_grammar	str	None	If specified, the output will follow the context-free grammar.

Sampling Parameters: Llama Stack API#

Params	Type	Default	Notes
strategy	str	“greedy”	Sampling strategy for generation. Must be one of `greedy`, `top_p`, and `top_k`.
repetition_penalty	float	1.0	Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2].
temperature	float	1.0	Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling.
top_p	float	1.0	Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
top_k	int	-1	Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise.

Important

The Llama Stack API currently does not support guided decoding.

Examples#

From command line:

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "<model-name>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image":
                            {
                                "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    },
                    "What is in this image?"
                ]
            }
        ],
        "sampling_params": {
            "temperature": 0.2,
            "top_p": 0.7,
            "max_tokens": 256
        }
    }'

Using the Llama Stack Client Python Library:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            },
            "What is in this image?"
        ]
    }
]

iterator = client.inference.chat_completion(
    model="<model-name>",
    messages=messages,
    sampling_params={
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 256,
    },
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)