Sampling Control#

NIM for VLMs exposes a suite of sampling parameters that offers users fine-grained control over the generation behavior of VLMs. Below is a complete reference on how to configure the sampling parameters in an inference request.

Sampling Parameters: OpenAI API#

Params

Type

Default

Notes

presence_penalty

float

0.0

Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2].

frequency_penalty

float

0.0

Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Must be in [-2, 2].

repetition_penalty

float

1.0

Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2].

temperature

float

1.0

Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling.

top_p

float

1.0

Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

top_k

int

-1

Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise.

min_p

float

0.0

Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable min_p.

seed

int

None

Random seed to use for the generation.

stop

str or List[str]

None

A string or list of strings that stop the generation when they are generated. The returned output will not contain the stop strings.

ignore_eos

bool

False

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Useful for performance benchmarking.

max_tokens

int

16

Maximum number of tokens to generate per output sequence. Must be >= 1.

min_tokens

int

0

Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated. Must be >= 0.

logprobs

int

None

Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprob + 1 elements in the response. Must be >= 0.

prompt_logprobs

int

None

Number of log probabilities to return per prompt token. Must be >= 0.

response_format

Dict[str, str]

None

Specifies the format that the model must output. Set to {'type': 'json_object'} to enable JSON mode, which guarantees the output the model generates is valid JSON. See Structured generation.

Examples#

From command line:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "<model-name>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What is in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url":
                            {
                                "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    }
                ]
            }
        ],
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 256
    }'

Using the OpenAI Python API library:

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What is in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            }
        ]
    }
]
chat_response = client.chat.completions.create(
    model="<model-name>",
    messages=messages,
    temperature=0.2,
    top_p=0.7,
    max_tokens=256,
    stream=False
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Advanced: Guided Decoding#

NIM for VLMs additionally supports guided decoding for structured generation through nvext. See Structured Generation for example use cases.

Params

Type

Default

Notes

guided_json

str, dict, or Pydantic BaseModel

None

If specified, the output will follow the JSON schema.

guided_regex

str

None

If specified, the output will follow the regex pattern.

guided_choice

List[str]

None

If specified, the output will be exactly one of the choices.

guided_grammar

str

None

If specified, the output will follow the context-free grammar.

Sampling Parameters: Llama Stack API#

Params

Type

Default

Notes

strategy

str

“greedy”

Sampling strategy for generation. Must be one of greedy, top_p, and top_k.

repetition_penalty

float

1.0

Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Must be in (0, 2].

temperature

float

1.0

Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Must be >= 0. Set to 0 for greedy sampling.

top_p

float

1.0

Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

top_k

int

-1

Controls the number of top tokens to consider. Set to -1 to consider all tokens. Must be >=1 otherwise.

Important

The Llama Stack API currently does not support guided decoding.

Examples#

From command line:

curl -X 'POST' \
'http://0.0.0.0:8000/inference/chat_completion' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "<model-name>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "image":
                            {
                                "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                            }
                    },
                    "What is in this image?"
                ]
            }
        ],
        "sampling_params": {
            "temperature": 0.2,
            "top_p": 0.7,
            "max_tokens": 256
        }
    }'

Using the Llama Stack Client Python Library:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://0.0.0.0:8000")

messages = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "uri": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                }
            },
            "What is in this image?"
        ]
    }
]

iterator = client.inference.chat_completion(
    model="<model-name>",
    messages=messages,
    sampling_params={
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 256,
    },
    stream=True
)

for chunk in iterator:
    print(chunk.event.delta, end="", flush=True)