Notes on NIM Container Variants#

Some NIMs are built with packages that vary from the standard base Docker container. These NIMs can better access features specific to a particular model or can run on GPUs before they are fully supported in the main source code branch. These NIMs, also known as NIM container variants, are designated by the -variant suffix in their version tag name.

These NIM container variants have important, underlying differences from NIMs built with the standard base container. These differences vary according to model. This page documents these differences with respect to the features and functionality of LLM NIM container version 1.15.0. Refer to the following:

DeepSeek-V4-Flash#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_FORCE_DETERMINISTIC
NIM_MAX_MODEL_LEN
NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME: Use NIM_SERVED_MODEL_NAME instead
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE: Not a reward model
NIM_REWARD_MODEL: Not a reward model
NIM_REWARD_MODEL_STRING: Not a reward model
NIM_TOKENIZER_MODE: Defaults to fast mode
NIM_ENABLE_PROMPT_EMBEDS
NIM_PER_REQ_METRICS_ENABLE
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

The NIM supports various thinking modes:

Reasoning Mode	Characteristics	Request Configuration
Non-think	Fast, intuitive responses; Chat mode without reasoning	Default, or `chat_template_kwargs: {"thinking": false}`
Think High	Standard thinking mode; deliberate logical analysis, slower but more accurate	`chat_template_kwargs: {"thinking": true}`
Think Max	Maximum reasoning effort; pushes reasoning to its fullest extent	`chat_template_kwargs: {"thinking": true},` `"reasoning_effort": "max"`

No other changes to usage and features are needed.

DeepSeek-V4-Pro#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_FORCE_DETERMINISTIC
NIM_MAX_MODEL_LEN
NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME: Use NIM_SERVED_MODEL_NAME instead
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE: Not a reward model
NIM_REWARD_MODEL: Not a reward model
NIM_REWARD_MODEL_STRING: Not a reward model
NIM_TOKENIZER_MODE: Defaults to fast mode
NIM_ENABLE_PROMPT_EMBEDS
NIM_PER_REQ_METRICS_ENABLE
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

The NIM supports various thinking modes:

Reasoning Mode	Characteristics	Request Configuration
Non-think	Fast, intuitive responses; Chat mode without reasoning	Default, or `chat_template_kwargs: {"thinking": false}`
Think High	Standard thinking mode; deliberate logical analysis, slower but more accurate	`chat_template_kwargs: {"thinking": true}`
Think Max	Maximum reasoning effort; pushes reasoning to its fullest extent	`chat_template_kwargs: {"thinking": true},` `"reasoning_effort": "max"`

No other changes to usage and features are needed.

DeepSeek-V3.2-Exp#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_DETERMINISTIC
NIM_FT_MODEL
NIM_GUIDED_DECODING_BACKEND: SGLang backend supports "xgrammar", "outlines", "llguidance", and "none", and does not support custom guided decoding backends. Refer to Custom Guided Decoding Backends.
NIM_JSONL_LOGGING
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_PER_REQ_METRICS_ENABLE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_SCHEDULER_POLICY: Supported, but SGLang has different possible values ("lpm", "random", "fcfs", "dfs-weight", "lof", and "priority") from the LLM NIM container version 1.14.0.
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

New Additions#

The following new environment variables are supported:

Note

Some variables may not be applicable to every model (for example, not all models support tool calling or thinking).

TOOL_CALL_PARSER / NIM_TOOL_CALL_PARSER
REASONING_PARSER / NIM_REASONING_PARSER
NIM_CHAT_TEMPLATE
NIM_TAGS_SELECTOR

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many API variables lack error handling methods.
Structured output
- vLLM uses guided_json, guided_choice and guided_regex followed by a string.
- SGLang uses response_format(json), such as the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
Tool calling with streams
- For SGLang, the second-to-last chunk contains the complete tool call content.
- For vLLM, all chunks contain streamed tool call content.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or int (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext
Usage stats (for example, prompt_tokens and total_tokens) from the /v1/response endpoint

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

DeepSeek-V3.1-Terminus#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_DETERMINISTIC
NIM_FT_MODEL
NIM_GUIDED_DECODING_BACKEND: SGLang backend supports "xgrammar", "outlines", "llguidance", and "none", and does not support custom guided decoding backends. Refer to Custom Guided Decoding Backends.
NIM_JSONL_LOGGING
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_MODEL_LEN
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_PER_REQ_METRICS_ENABLE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_SCHEDULER_POLICY: Supported, but SGLang has different possible values ("lpm", "random", "fcfs", "dfs-weight", "lof", and "priority") from the LLM NIM container version 1.14.0.
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

New Additions#

The following new environment variables are supported:

Note

Some variables may not be applicable to every model (for example, not all models support tool calling or thinking).

TOOL_CALL_PARSER / NIM_TOOL_CALL_PARSER
REASONING_PARSER / NIM_REASONING_PARSER
NIM_CHAT_TEMPLATE
NIM_TAGS_SELECTOR

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods.
Structured output
- vLLM uses guided_json, guided_choice and guided_regex followed by a string.
- SGLang uses response_format(json), such as the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
Tool calling with streams
- For SGLang, the second-to-last chunk contains the complete tool call content.
- For vLLM, all chunks contain streamed tool call content.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or int (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

GPT-OSS-120B#

Environment Variables#

Not Supported#

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_CUDA_GRAPH : Defaults to False
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_EMBEDS
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_DETERMINISTIC
NIM_FORCE_TRUST_REMOTE_CODE : Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE : This is no longer required
NIM_PER_REQ_METRICS_ENABLE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_REWARD_LOGITS_RANGE
NIM_SCHEDULER_POLICY
NIM_SERVED_MODEL_NAME : Only a single name is supported
NIM_TELEMETRY_INTERVAL_MINUTES
NIM_TELEMETRY_MODE
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE : Use NIM_SSL_CERT_PATH instead

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

GPT-OSS-20B#

Environment Variables#

Not Supported#

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_CUDA_GRAPH : Defaults to False
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_EMBEDS
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_DETERMINISTIC
NIM_FORCE_TRUST_REMOTE_CODE : Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE : This is no longer required
NIM_PER_REQ_METRICS_ENABLE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_REWARD_LOGITS_RANGE
NIM_SCHEDULER_POLICY
NIM_SERVED_MODEL_NAME : Only a single name is supported
NIM_TELEMETRY_INTERVAL_MINUTES
NIM_TELEMETRY_MODE
NIM_TOKENIZER_MODE : Defaults to fast mode
SSL_CERT_FILE : Use NIM_SSL_CERT_PATH instead

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

Llama-3.1-8b-Instruct-DGX-Spark#

This NIM container variant was released with LLM NIM container version 1.14 and uses the 1.0.0-variant tag. For more information, refer to the 1.14 version of this page.

MiMo-V2-Flash#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME: Use NIM_SERVED_MODEL_NAME instead
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_EMBEDS
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_BATCH_SIZE
NIM_MAX_MODEL_LEN
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PER_REQ_METRICS_ENABLE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REWARD_LOGITS_RANGE: Not a reward model
NIM_REWARD_MODEL: Not a reward model
NIM_REWARD_MODEL_STRING: Not a reward model
NIM_SCHEDULER_POLICY
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
NIM_TELEMETRY_MODE
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Deployment Considerations#

One of the following configurations is required:

B200 in a TP8 configuration
- Set NIM_ATTENTION_BACKEND=triton and NIM_ENABLE_MTP=False
B200 in a TP4 configuration
- Set NIM_ATTENTION_BACKEND=triton, NIM_ENABLE_MTP=False, NIM_ENABLE_DP_ATTENTION=False, and DP_SIZE=1
H200 in a TP8 configuration
- Set NIM_KV_CACHE_PERCENT=0.80
GB200 in a TP4 configuration
- Set NIM_ATTENTION_BACKEND=triton, NIM_ENABLE_MTP=False, NIM_ENABLE_DP_ATTENTION=False, and DP_SIZE=1
H100 in a TP8 configuration
- Set NIM_KV_CACHE_PERCENT=0.70
H20 in a TP8 configuration
- Set NIM_KV_CACHE_PERCENT=0.70

Usage Changes and Features#

Tool Calling#

For tool calling, you should use the following system prompt:

{"role": "system", "content": "You are a helpful assistant with access to tools."}

Chat Template Keyword and Thinking#

When the chat_template_kwargs field is omitted from the request, the model returns content: null and the response in the reasoning_content field only. When the chat_template_kwargs field is included, both content and reasoning_content fields are populated in the response.

For best performance, add the chat_template_kwargs field to the request and use it to enable thinking:

"chat_template_kwargs": {"enable_thinking": true}

Sampling Parameters#

You should use the following values for sampling parameters:

top_p = 0.95
temperature = 0.8 # For math, writing, and web-dev
temperature = 0.3 # For agentic taks (for example, vibe coding and tool use)

No other changes to usage and features are needed.

MiniMax-M2.5#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME: Use NIM_SERVED_MODEL_NAME instead
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_EMBEDS
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_PER_REQ_METRICS_ENABLE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE: Not a reward model
NIM_REWARD_MODEL: Not a reward model
NIM_REWARD_MODEL_STRING: Not a reward model
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

MiniMax-M2.7#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_MAX_MODEL_LEN
NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME: Use NIM_SERVED_MODEL_NAME instead
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE: Not a reward model
NIM_REWARD_MODEL: Not a reward model
NIM_REWARD_MODEL_STRING: Not a reward model
NIM_TOKENIZER_MODE: Defaults to fast mode
NIM_ENABLE_PROMPT_EMBEDS
NIM_PER_REQ_METRICS_ENABLE
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

Nemotron 3 Nano#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_CUDA_GRAPH: Defaults to False
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_DETERMINISTIC
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_REWARD_LOGITS_RANGE
NIM_SCHEDULER_POLICY
NIM_SERVED_MODEL_NAME: Only a single name is supported
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Use NIM_SSL_CERT_PATH instead

Note

Most of these variables are not used with an SGLang backend.

New Additions#

The following new environment variables are supported:

Note

Some variables may not be applicable to every model (for example, not all models support tool calling or thinking).

NIM_TAGS_SELECTOR: Filters tags in the automatic profile selector. You can use a list of key-value pairs, where the key is the profile property name and the value is the desired property value. For example, set NIM_TAGS_SELECTOR="profile=latency" to automatically select the latency profile. Or set NIM_TAGS_SELECTOR="tp=4" to select a throughput profile that supports 4 GPUs.
DISABLE_RADIX_CACHE: Set to 1 to disable KV cache reuse.
NIM_ENABLE_MTP: Set to 1 to enable the LLM to generate several tokens at once, boosting speed, efficiency, and reasoning.
REASONING_PARSER: Set to 1 to turn thinking on.
TOOL_CALL_PARSER: Set to 1 to turn tool calling on.
NIM_CONFIG_FILE: Specifies a configuration YAML file for advanced parameter tuning. Use this file to overwrite the default NIM configuration values. You must convert the hyphens in server argument names to underscores. For example, the following SGLang command arguments:
```
python -m sglang.launch_server --model-path XXX --tp-size 4 \
  --context-length 262144 --mem-fraction-static 0.8
```
are defined by the following content in the configuration YAML file:
```
tp_size: 4
context_length: 262144
mem_fraction_static: 0.8
```
Default value: None.

API Compatibility#

The following API features are not supported:

logprobs
suffix
Guided decoding (including guided_whitespace_pattern and structured_generation)
Echo and role configuration
Reward
Llama API
nvext

nvext features are supported using different parameters in the top-level payload.

Security Features#

No changes to security features. These models maintain the same security features and capabilities as standard models. No additional security limitations or modifications apply.

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

Nemotron-3-Super-120B-A12B#

Deployment Considerations#

This NIM might need additional configuration for deployment. In addition to the information in Get Started, use the information in this section to deploy the NIM on a given GPU. Refer to Nemotron-3-Super-120B-A12B on the Supported Models page for information about the supported profiles.

Memory Configuration#

Due to the large model size, you might encounter out of memory (OOM) errors when deploying the following profiles:

GPU	Precision	TP
B200	NVFP4	TP1, TP2
B200	FP8	TP1, TP2
B200	BF16	TP2
H200	FP8	TP1, TP2
H200	BF16	TP2
H200 NVL	FP8	TP1, TP2
H200 NVL	BF16	TP2
GB200	FP8	TP1, TP2
GB200	BF16	TP2
GH200	All profiles
H100	FP8	TP2, TP4
H100	BF16	TP4, TP8
H100 NVL	FP8	TP2
A100	All profiles
A100 40GB	All profiles
L40S	All profiles
NVIDIA RTX PRO 6000 Blackwell Server Edition	FP8	TP2, TP4, TP8
NVIDIA RTX PRO 6000 Blackwell Server Edition	NVFP4	TP1, TP2, TP4, TP8

Set the following environment variables to adjust memory usage:

NIM_MAX_MODEL_LEN
NIM_KVCACHE_PERCENT
NIM_MAX_BATCH_SIZE

Chunked prefill is disabled by default, which can cause errors when running the quantization profile with a low TP (1 or 2). The typical symptom is the log message expr_fits_within_32bit. To resolve this, either reduce the model length by setting NIM_MAX_MODEL_LEN=131072, or re-enable chunked prefill with NIM_ENABLE_CHUNKED_PREFILL=1.

GPU and Profile-Specific Required Settings#

Set the following environment variables per specific GPU and profile:

L40S GPU and FP8 profiles: VLLM_USE_FLASHINFER_MOE_FP8=0
L40S GPU and the FP8 TP4 profile: NIM_MAX_MODEL_LEN=65536
B200 and GB200 GPUs (for maximum accuracy):
- FP8 and NVFP4 profiles:
  - MAMBA_CACHE_RS_ROUNDING=1
  - MAMBA_CACHE_PHILOX_ROUNDS=5
  - NIM_ATTENTION_BACKEND=TRITON_ATTN
- BF16 profiles:
  - NIM_ATTENTION_BACKEND=FLASH_ATTN
L40S, H100, H200, and RTX 6000 GPUs and FP8 profiles: NIM_MAMBA_SSM_CACHE_DTYPE=float32

LoRA Deployment#

To deploy any LoRA profile, set the following environment variables:

NIM_MAX_LORA_RANK: You should set this variable to 32, 16, or lower.
NIM_MAX_GPU_LORAS
NIM_MAX_CPU_LORAS

Note the following:

LoRA NVFP4 profiles are not supported.
To deploy LoRA profiles on Blackwell GPUs, set the environment variable VLLM_LORA_DISABLE_PDL=1.

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_CUDA_GRAPH: Defaults to False
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_EMBEDS
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_DETERMINISTIC
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_REWARD_LOGITS_RANGE
NIM_SCHEDULER_POLICY
NIM_SERVED_MODEL_NAME: Only a single name is supported
NIM_TELEMETRY_MODE
NIM_TELEMETRY_INTERVAL_MINUTES
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Use NIM_SSL_CERT_PATH instead

Note

Most of these variables are not used with an SGLang backend.

New Additions#

The following new environment variables are supported:

Note

Some variables might not be applicable to every model (for example, not all models support tool calling or thinking).

NIM_TAGS_SELECTOR: Filters tags in the automatic profile selector. You can use a list of key-value pairs, where the key is the profile property name and the value is the desired property value. For example, set NIM_TAGS_SELECTOR="profile=latency" to automatically select the latency profile. Or set NIM_TAGS_SELECTOR="tp=4" to select a throughput profile that supports four GPUs.
DISABLE_RADIX_CACHE: Set to 1 to disable KV cache reuse.
NIM_ENABLE_MTP: Set to 1 to enable the LLM to generate several tokens at once, boosting speed, efficiency, and reasoning.
REASONING_PARSER: Set to 1 to turn thinking on.
TOOL_CALL_PARSER: Set to 1 to turn tool calling on.
NIM_CONFIG_FILE: Specifies a configuration YAML file for advanced parameter tuning. Use this file to overwrite the default NIM configuration values. You must convert the hyphens in server argument names to underscores. For example, the following SGLang command arguments:
```
python -m sglang.launch_server --model-path XXX --tp-size 4 \
  --context-length 262144 --mem-fraction-static 0.8
```
are defined by the following content in the configuration YAML file:
```
tp_size: 4
context_length: 262144
mem_fraction_static: 0.8
```
Default value: None.

API Compatibility#

The following API features are not supported:

logprobs
suffix
Echo and role configuration
Reward
Llama API
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The /v1/metrics endpoint returns metric names that use the vllm: prefix for the vLLM backend.

The following metrics (from Observability) must be queried using their prefixed names:

Documented Metric Name	Prefixed Metric Name
`gpu_cache_usage_perc`	`vllm:kv_cache_usage_perc`
`num_requests_running`	`vllm:num_requests_running`
`num_requests_waiting`	`vllm:num_requests_waiting`
`prompt_tokens_total`	`vllm:prompt_tokens_total`
`generation_tokens_total`	`vllm:generation_tokens_total`
`time_to_first_token_seconds`	`vllm:time_to_first_token_seconds`
`time_per_output_token_seconds`	`vllm:request_time_per_output_token_seconds`
`e2e_request_latency_seconds`	`vllm:e2e_request_latency_seconds`
`request_prompt_tokens`	`vllm:request_prompt_tokens`
`request_generation_tokens`	`vllm:request_generation_tokens`
`request_success_total`	`vllm:request_success_total`

Note that gpu_cache_usage_perc has also been renamed to kv_cache_usage_perc in addition to the prefix change. Update any Prometheus queries, Grafana dashboards, or alerting rules accordingly.

Security Features#

No changes to security features. These models maintain the same security features and capabilities as standard models. No additional security limitations or modifications apply.

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

Use the reasoning_budget field in the request to control the thinking budget. Use the low_effort field in the request to limit the thinking effort without setting an explicit thinking budget.

For tool calling, the model supports setting tool_choice: "required", which forces the model to call a tool. The model also supports named tool calls, which let you specify a tool by name in tool_choice.

No other changes to usage and features are needed.

NVIDIA-Nemotron-Nano-9B-v2-DGX-Spark#

This NIM container variant was released with LLM NIM container version 1.14 and uses the 1.0.0-variant tag. For more information, refer to the 1.14 version of this page.

Qwen3-Coder-Next#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_CUDA_GRAPH: Defaults to False
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_JSONL_LOGGING
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
When using tool calling with streams, all chunks contain streamed tool call content.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

Qwen3-Next-80B-A3B-Instruct#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_DISABLE_CUDA_GRAPH: Defaults to False
NIM_DISABLE_OVERLAP_SCHEDULING
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_JSONL_LOGGING
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
Tool calling with streams
- For SGLang, the second-to-last chunk contains the complete tool call content.
- For vLLM, all chunks contain streamed tool call content.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

Qwen3 Next 80B A3B Thinking#

This NIM container variant was released with LLM NIM container version 1.14 and uses the 1.0.0-variant tag. For more information, refer to the 1.14 version of this page.

Qwen3-32B#

This NIM container variant was released with LLM NIM container version 1.14 and uses the 1.0.0 tag. For more information, refer to the 1.14 version of this page.

Qwen3-32B NIM for DGX Spark#

This NIM container variant was released with LLM NIM container version 1.14 and uses the 1.0.0-variant tag. For more information, refer to the 1.14 version of this page.

GLM-5#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_TOKENIZER_MODE: Defaults to fast mode
NIM_ENABLE_PROMPT_EMBEDS
NIM_PER_REQ_METRICS_ENABLE
NIM_TELEMETRY_MODE
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

GLM-5.1#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_TOKENIZER_MODE: Defaults to fast mode
NIM_ENABLE_PROMPT_EMBEDS
NIM_PER_REQ_METRICS_ENABLE
NIM_TELEMETRY_MODE
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

No other changes to usage and features are needed.

GLM-5.2#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_TOKENIZER_MODE: Defaults to fast mode
NIM_ENABLE_PROMPT_EMBEDS
NIM_PER_REQ_METRICS_ENABLE
NIM_TELEMETRY_MODE
NIM_TELEMETRY_ENABLE_ON_RTX
NIM_TELEMETRY_INTERVAL_MINUTES
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:/opt/nim/workspace

Note

On Kubernetes, set the startupProbe failureThreshold to 540 to allow enough time for this model to start up.

No other changes to usage and features are needed.

Step-3.5-Flash#

Environment Variables#

Not Supported#

The following environment variables are not currently supported:

NIM_CUSTOM_GUIDED_DECODING_BACKENDS
NIM_CUSTOM_MODEL_NAME
NIM_ENABLE_DP_ATTENTION
NIM_ENABLE_KV_CACHE_HOST_OFFLOAD
NIM_ENABLE_PROMPT_LOGPROBS
NIM_FORCE_TRUST_REMOTE_CODE: Defaults to True
NIM_FT_MODEL
NIM_KV_CACHE_HOST_MEM_FRACTION
NIM_LOW_MEMORY_MODE
NIM_MANIFEST_ALLOW_UNSAFE: No longer required
NIM_MAX_CPU_LORAS
NIM_MAX_GPU_LORAS
NIM_NUM_KV_CACHE_SEQ_LENS
NIM_PEFT_REFRESH_INTERVAL
NIM_PEFT_SOURCE
NIM_RELAX_MEM_CONSTRAINTS
NIM_REPOSITORY_OVERRIDE
NIM_REWARD_LOGITS_RANGE
NIM_REWARD_MODEL
NIM_REWARD_MODEL_STRING
NIM_TOKENIZER_MODE: Defaults to fast mode
SSL_CERT_FILE: Set both NIM_SSL_CERT_PATH and SSL_CERT_FILE to the same location

Note

Most of these variables are not used with an SGLang backend.

API Compatibility#

The following API features have differences according to the backend used:

Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
- vLLM uses guided_json, guided_choice, and guided_regex followed by a string.
- SGLang uses response_format(json), similar to the following:
```
response_format={ "type": "json_schema", "schema": "<json-schema-string>" }
```
include_stop_str_in_output and continuous_usage_stats are not supported by SGLang.
top_logprobs
- For TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no top_logprobs (that is, "finish_reason": "stop").
- For vLLM, the final chunk contains content.
Setting a stop word
- For vLLM and TRT-LLM, stop_reason is used.
- For SGLang, matched_stop is used.
Echo configuration
- SGLang supports boolean or integer (1 or 0) input.
- vLLM supports boolean or null input.

The following API features only have support at the function level:

logprobs
Guided decoding (including guided_whitespace_pattern and structured_generation)
Role configuration

The following API features are not supported:

Reward
Llama API
Structured output (guided_json, guided_choice and guided_regex): Use response_format instead
nvext

nvext features are supported using different parameters in the top-level payload.

Metrics#

The output of v1/metrics has differences according to the backend used (SGLang versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.

Additional metrics related to GPU resources have been added.

The following v1/metrics are not supported:

Request success rate metrics:
- request_success_total
- request_failure_total
- request_finish_total
KV cache metrics

Usage Changes and Features#

The container docker run command does not support the -u $(id -u) parameter.

For air gap deployment, add the following parameters to the docker run command:

-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>

No other changes to usage and features are needed.