Sampling Control#
NIM for VLMs exposes a suite of sampling parameters that offers users fine-grained control over the generation behavior of VLMs. Below is a complete reference on how to configure the sampling parameters in an inference request.
Sampling Parameters: OpenAI API#
| Params | Type | Default | Notes | 
|---|---|---|---|
| presence_penalty | float | 0.0 | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2]. | 
| frequency_penalty | float | 0.0 | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2]. | 
| repetition_penalty | float | 1.0 | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2]. | 
| temperature | float | 1.0 | Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling. | 
| top_p | float | 1.0 | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | 
| top_k | int | -1 | Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise. | 
| min_p | float | 0.0 | Represents the minimum probability for a token to be considered,
relative to the probability of the most likely token. Must be in
[0, 1]. Set to 0 to disable  | 
| seed | int | None | Random seed to use for the generation. | 
| stop | str or List[str] | None | A string or list of strings that stop the generation when they are generated. The returned output will not contain the stop strings. | 
| ignore_eos | bool | False | Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Useful for performance benchmarking. | 
| max_tokens | int | 16 | Maximum number of tokens to generate per output sequence. Must be >= 1. | 
| min_tokens | int | 0 | Minimum number of tokens to generate per output sequence before
EOS or  | 
| logprobs | int | None | Number of log probabilities to return per output token. When set
to None, no probability is returned. If set to a non-None value,
the result includes the log probabilities of the specified
number of most likely tokens, as well as the chosen tokens.
Note that the implementation follows the OpenAI API: The API
will always return the log probability of the sampled token, so
there may be up to  | 
| prompt_logprobs | int | None | Number of log probabilities to return per prompt token. Must be >= 0. | 
| response_format | Dict[str, str] | None | Specifies the format that the model must output. Set to
 | 
Examples#
From command line:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "<model-name>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 256
    }'
Using the OpenAI Python API library:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="<model-name>",
    messages=messages,
    temperature=0.2,
    top_p=0.7,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Advanced: Guided Decoding#
NIM for VLMs additionally supports guided decoding for structured generation through nvext. See Structured Generation for example use cases.
| Params | Type | Default | Notes | 
|---|---|---|---|
| guided_json | str, dict, or Pydantic BaseModel | None | If specified, the output will follow the JSON schema. | 
| guided_regex | str | None | If specified, the output will follow the regex pattern. | 
| guided_choice | List[str] | None | If specified, the output will be exactly one of the choices. | 
| guided_grammar | str | None | If specified, the output will follow the context-free grammar. |