Sampling Control#
NIM for VLMs exposes a suite of sampling parameters that offers users fine-grained control over the generation behavior of VLMs. Below is a complete reference on how to configure the sampling parameters in an inference request.
Sampling Parameters: OpenAI API#
Params |
Type |
Default |
Notes |
---|---|---|---|
presence_penalty |
float |
0.0 |
Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2]. |
frequency_penalty |
float |
0.0 |
Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2]. |
repetition_penalty |
float |
1.0 |
Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2]. |
temperature |
float |
1.0 |
Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling. |
top_p |
float |
1.0 |
Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
top_k |
int |
-1 |
Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise. |
min_p |
float |
0.0 |
Represents the minimum probability for a token to be considered,
relative to the probability of the most likely token. Must be in
[0, 1]. Set to 0 to disable |
seed |
int |
None |
Random seed to use for the generation. |
stop |
str or List[str] |
None |
A string or list of strings that stop the generation when they are generated. The returned output will not contain the stop strings. |
ignore_eos |
bool |
False |
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Useful for performance benchmarking. |
max_tokens |
int |
16 |
Maximum number of tokens to generate per output sequence. Must be >= 1. |
min_tokens |
int |
0 |
Minimum number of tokens to generate per output sequence before
EOS or |
logprobs |
int |
None |
Number of log probabilities to return per output token. When set
to None, no probability is returned. If set to a non-None value,
the result includes the log probabilities of the specified
number of most likely tokens, as well as the chosen tokens.
Note that the implementation follows the OpenAI API: The API
will always return the log probability of the sampled token, so
there may be up to |
prompt_logprobs |
int |
None |
Number of log probabilities to return per prompt token. Must be >= 0. |
response_format |
Dict[str, str] |
None |
Specifies the format that the model must output. Set to
|
Examples#
From command line:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<model-name>",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url":
{
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
],
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 256
}'
Using the OpenAI Python API library:
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
}
]
}
]
chat_response = client.chat.completions.create(
model="<model-name>",
messages=messages,
temperature=0.2,
top_p=0.7,
max_tokens=256,
stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Advanced: Guided Decoding#
NIM for VLMs additionally supports guided decoding for structured generation through nvext
. See Structured Generation for example use cases.
Params |
Type |
Default |
Notes |
---|---|---|---|
guided_json |
str, dict, or Pydantic BaseModel |
None |
If specified, the output will follow the JSON schema. |
guided_regex |
str |
None |
If specified, the output will follow the regex pattern. |
guided_choice |
List[str] |
None |
If specified, the output will be exactly one of the choices. |
guided_grammar |
str |
None |
If specified, the output will follow the context-free grammar. |
Sampling Parameters: Llama Stack API#
Params |
Type |
Default |
Notes |
---|---|---|---|
strategy |
str |
“greedy” |
Sampling strategy for generation. Must be one of |
repetition_penalty |
float |
1.0 |
Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2]. |
temperature |
float |
1.0 |
Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling. |
top_p |
float |
1.0 |
Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
top_k |
int |
-1 |
Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise. |
Important
The Llama Stack API currently does not support guided decoding.
Examples#
From command line:
curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<model-name>",
"messages": [
{
"role": "user",
"content": [
{
"image":
{
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
}
],
"sampling_params": {
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 256
}
}'
Using the Llama Stack Client Python Library:
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")
messages = [
{
"role": "user",
"content": [
{
"image": {
"uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
}
},
"What is in this image?"
]
}
]
iterator = client.inference.chat_completion(
model="<model-name>",
messages=messages,
sampling_params={
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 256,
},
stream=True
)
for chunk in iterator:
print(chunk.event.delta, end="", flush=True)