Notes on NIM Container Variants#
Some NIMs are built with packages that vary from the standard base Docker container. These NIMs can better access features specific to a particular model or can run on GPUs before they are fully supported in the main source code branch. These NIMs, also known as NIM container variants, are designated by the -variant suffix in their version tag name.
These NIM container variants have important, underlying differences from NIMs built with the standard base container. These differences vary according to model. This page documents these differences with respect to the features and functionality of LLM NIM container version 1.15.0. Refer to the following:
DeepSeek-V3.2-Exp#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_DETERMINISTICNIM_FT_MODELNIM_GUIDED_DECODING_BACKEND: SGLang backend supports"xgrammar","outlines","llguidance", and"none", and does not support custom guided decoding backends. Refer to Custom Guided Decoding Backends.NIM_JSONL_LOGGINGNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_PER_REQ_METRICS_ENABLENIM_REWARD_LOGITS_RANGENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_SCHEDULER_POLICY: Supported, but SGLang has different possible values ("lpm","random","fcfs","dfs-weight","lof", and"priority") from the LLM NIM container version 1.14.0.SSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
New Additions#
The following new environment variables are supported:
Note
Some variables may not be applicable to every model (for example, not all models support tool calling or thinking).
TOOL_CALL_PARSER/NIM_TOOL_CALL_PARSERREASONING_PARSER/NIM_REASONING_PARSERNIM_CHAT_TEMPLATENIM_TAGS_SELECTOR
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many API variables lack error handling methods.
Structured output
vLLM uses
guided_json,guided_choiceandguided_regexfollowed by a string.SGLang uses
response_format(json), such as the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.Tool calling with streams
For SGLang, the second-to-last chunk contains the complete tool call content.
For vLLM, all chunks contain streamed tool call content.
top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or int (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvextUsage stats (for example,
prompt_tokensandtotal_tokens) from the/v1/responseendpoint
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example,
SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.
DeepSeek-V3.1-Terminus#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_DETERMINISTICNIM_FT_MODELNIM_GUIDED_DECODING_BACKEND: SGLang backend supports"xgrammar","outlines","llguidance", and"none", and does not support custom guided decoding backends. Refer to Custom Guided Decoding Backends.NIM_JSONL_LOGGINGNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_PER_REQ_METRICS_ENABLENIM_REWARD_LOGITS_RANGENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_SCHEDULER_POLICY: Supported, but SGLang has different possible values ("lpm","random","fcfs","dfs-weight","lof", and"priority") from the LLM NIM container version 1.14.0.SSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
New Additions#
The following new environment variables are supported:
Note
Some variables may not be applicable to every model (for example, not all models support tool calling or thinking).
TOOL_CALL_PARSER/NIM_TOOL_CALL_PARSERREASONING_PARSER/NIM_REASONING_PARSERNIM_CHAT_TEMPLATENIM_TAGS_SELECTOR
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many variables lack error handling methods.
Structured output
vLLM uses
guided_json,guided_choiceandguided_regexfollowed by a string.SGLang uses
response_format(json), such as the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.Tool calling with streams
For SGLang, the second-to-last chunk contains the complete tool call content.
For vLLM, all chunks contain streamed tool call content.
top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or int (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example,
SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.
GPT-OSS-120B#
Environment Variables#
Not Supported#
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_CUDA_GRAPH: Defaults to FalseNIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_EMBEDSNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_DETERMINISTICNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: This is no longer requiredNIM_PER_REQ_METRICS_ENABLENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_REWARD_LOGITS_RANGENIM_SCHEDULER_POLICYNIM_SERVED_MODEL_NAME: Only a single name is supportedNIM_TELEMETRY_INTERVAL_MINUTESNIM_TELEMETRY_MODENIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: UseNIM_SSL_CERT_PATHinstead
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
GPT-OSS-20B#
Environment Variables#
Not Supported#
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_CUDA_GRAPH: Defaults to FalseNIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_EMBEDSNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_DETERMINISTICNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: This is no longer requiredNIM_PER_REQ_METRICS_ENABLENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_REWARD_LOGITS_RANGENIM_SCHEDULER_POLICYNIM_SERVED_MODEL_NAME: Only a single name is supportedNIM_TELEMETRY_INTERVAL_MINUTESNIM_TELEMETRY_MODENIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: UseNIM_SSL_CERT_PATHinstead
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
Llama-3.1-8b-Instruct-DGX-Spark#
This NIM container variant was released with LLM NIM container version 1.14 and
uses the 1.0.0-variant tag. For more information, refer to the
1.14 version of this page.
MiniMax-M2.5#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAME: UseNIM_SERVED_MODEL_NAMEinsteadNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_EMBEDSNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_PER_REQ_METRICS_ENABLENIM_RELAX_MEM_CONSTRAINTSNIM_REPOSITORY_OVERRIDENIM_REWARD_LOGITS_RANGE: Not a reward modelNIM_REWARD_MODEL: Not a reward modelNIM_REWARD_MODEL_STRING: Not a reward modelNIM_TELEMETRY_ENABLE_ON_RTXNIM_TELEMETRY_INTERVAL_MINUTESNIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
vLLM uses
guided_json,guided_choice, andguided_regexfollowed by a string.SGLang uses
response_format(json), similar to the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or integer (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
KV cache metrics
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.
Nemotron 3 Nano#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_CUDA_GRAPH: Defaults to FalseNIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_DETERMINISTICNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_REWARD_LOGITS_RANGENIM_SCHEDULER_POLICYNIM_SERVED_MODEL_NAME: Only a single name is supportedNIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: UseNIM_SSL_CERT_PATHinstead
Note
Most of these variables are not used with an SGLang backend.
New Additions#
The following new environment variables are supported:
Note
Some variables may not be applicable to every model (for example, not all models support tool calling or thinking).
NIM_TAGS_SELECTOR: Filters tags in the automatic profile selector. You can use a list of key-value pairs, where the key is the profile property name and the value is the desired property value. For example, setNIM_TAGS_SELECTOR="profile=latency"to automatically select the latency profile. Or setNIM_TAGS_SELECTOR="tp=4"to select a throughput profile that supports 4 GPUs.DISABLE_RADIX_CACHE: Set to1to disable KV cache reuse.NIM_ENABLE_MTP: Set to1to enable the LLM to generate several tokens at once, boosting speed, efficiency, and reasoning.REASONING_PARSER: Set to1to turn thinking on.TOOL_CALL_PARSER: Set to1to turn tool calling on.NIM_CONFIG_FILE: Specifies a configuration YAML file for advanced parameter tuning. Use this file to overwrite the default NIM configuration values. You must convert the hyphens in server argument names to underscores. For example, the following SGLang command arguments:python -m sglang.launch_server --model-path XXX --tp-size 4 \ --context-length 262144 --mem-fraction-static 0.8
are defined by the following content in the configuration YAML file:
tp_size: 4 context_length: 262144 mem_fraction_static: 0.8
Default value:
None.
API Compatibility#
The following API features are not supported:
logprobssuffixGuided decoding (including
guided_whitespace_patternandstructured_generation)Echo and role configuration
Reward
Llama API
nvext
nvext features are supported using different parameters in the top-level
payload.
Security Features#
No changes to security features. These models maintain the same security features and capabilities as standard models. No additional security limitations or modifications apply.
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:<model-weight-path> \
-e NIM_MODEL_PATH=<model-weight-path> \
No other changes to usage and features are needed.
Nemotron-3-Super-120B-A12B#
Deployment Considerations#
This NIM might need additional configuration for deployment. In addition to the information in Get Started, use the information in this section to deploy the NIM on a given GPU. Refer to Nemotron-3-Super-120B-A12B on the Supported Models page for information about the supported profiles.
Memory Configuration#
Due to the large model size, you might encounter out of memory (OOM) errors when deploying the following profiles:
GPU |
Precision |
TP |
|---|---|---|
B200 |
NVFP4 |
TP1, TP2 |
B200 |
FP8 |
TP1, TP2 |
B200 |
BF16 |
TP2 |
H200 |
FP8 |
TP1, TP2 |
H200 |
BF16 |
TP2 |
H200 NVL |
FP8 |
TP1, TP2 |
H200 NVL |
BF16 |
TP2 |
GB200 |
FP8 |
TP1, TP2 |
GB200 |
BF16 |
TP2 |
GH200 |
All profiles |
|
H100 |
FP8 |
TP2, TP4 |
H100 |
BF16 |
TP4, TP8 |
H100 NVL |
FP8 |
TP2 |
A100 |
All profiles |
|
A100 40GB |
All profiles |
|
L40S |
All profiles |
|
NVIDIA RTX PRO 6000 Blackwell Server Edition |
FP8 |
TP2, TP4, TP8 |
NVIDIA RTX PRO 6000 Blackwell Server Edition |
NVFP4 |
TP1, TP2, TP4, TP8 |
Set the following environment variables to adjust memory usage:
NIM_MAX_MODEL_LENNIM_KVCACHE_PERCENTNIM_MAX_BATCH_SIZE
Chunked prefill is disabled by default, which can cause errors when running the
quantization profile with a low TP (1 or 2). The typical symptom is the log
message expr_fits_within_32bit. To resolve this, either reduce the model
length by setting NIM_MAX_MODEL_LEN=131072, or re-enable chunked prefill with
NIM_ENABLE_CHUNKED_PREFILL=1.
GPU and Profile-Specific Required Settings#
Set the following environment variables per specific GPU and profile:
L40S GPU and FP8 profiles:
VLLM_USE_FLASHINFER_MOE_FP8=0L40S GPU and the FP8 TP4 profile:
NIM_MAX_MODEL_LEN=65536B200 and GB200 GPUs (for maximum accuracy):
FP8 and NVFP4 profiles:
MAMBA_CACHE_RS_ROUNDING=1MAMBA_CACHE_PHILOX_ROUNDS=5NIM_ATTENTION_BACKEND=TRITON_ATTN
BF16 profiles:
NIM_ATTENTION_BACKEND=FLASH_ATTN
L40S, H100, H200, and RTX 6000 GPUs and FP8 profiles:
NIM_MAMBA_SSM_CACHE_DTYPE=float32
LoRA Deployment#
To deploy any LoRA profile, set the following environment variables:
NIM_MAX_LORA_RANK: You should set this variable to 32, 16, or lower.NIM_MAX_GPU_LORASNIM_MAX_CPU_LORAS
Note the following:
LoRA NVFP4 profiles are not supported.
To deploy LoRA profiles on Blackwell GPUs, set the environment variable
VLLM_LORA_DISABLE_PDL=1.
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_CUDA_GRAPH: Defaults to FalseNIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_EMBEDSNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_DETERMINISTICNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_REWARD_LOGITS_RANGENIM_SCHEDULER_POLICYNIM_SERVED_MODEL_NAME: Only a single name is supportedNIM_TELEMETRY_MODENIM_TELEMETRY_INTERVAL_MINUTESNIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: UseNIM_SSL_CERT_PATHinstead
Note
Most of these variables are not used with an SGLang backend.
New Additions#
The following new environment variables are supported:
Note
Some variables might not be applicable to every model (for example, not all models support tool calling or thinking).
NIM_TAGS_SELECTOR: Filters tags in the automatic profile selector. You can use a list of key-value pairs, where the key is the profile property name and the value is the desired property value. For example, setNIM_TAGS_SELECTOR="profile=latency"to automatically select the latency profile. Or setNIM_TAGS_SELECTOR="tp=4"to select a throughput profile that supports four GPUs.DISABLE_RADIX_CACHE: Set to1to disable KV cache reuse.NIM_ENABLE_MTP: Set to1to enable the LLM to generate several tokens at once, boosting speed, efficiency, and reasoning.REASONING_PARSER: Set to1to turn thinking on.TOOL_CALL_PARSER: Set to1to turn tool calling on.NIM_CONFIG_FILE: Specifies a configuration YAML file for advanced parameter tuning. Use this file to overwrite the default NIM configuration values. You must convert the hyphens in server argument names to underscores. For example, the following SGLang command arguments:python -m sglang.launch_server --model-path XXX --tp-size 4 \ --context-length 262144 --mem-fraction-static 0.8
are defined by the following content in the configuration YAML file:
tp_size: 4 context_length: 262144 mem_fraction_static: 0.8
Default value:
None.
API Compatibility#
The following API features are not supported:
logprobssuffixEcho and role configuration
Reward
Llama API
nvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The /v1/metrics endpoint returns metric names that use the vllm: prefix for
the vLLM backend.
The following metrics (from Observability) must be queried using their prefixed names:
Documented Metric Name |
Prefixed Metric Name |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note that gpu_cache_usage_perc has also been renamed to kv_cache_usage_perc
in addition to the prefix change. Update any Prometheus queries, Grafana
dashboards, or alerting rules accordingly.
Security Features#
No changes to security features. These models maintain the same security features and capabilities as standard models. No additional security limitations or modifications apply.
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 \
-v <local-model-path>:<model-weight-path> \
-e NIM_MODEL_PATH=<model-weight-path> \
Use the reasoning_budget field in the request to control
the thinking budget. Use the
low_effort field in the request to limit the thinking effort without
setting an explicit
thinking budget.
For tool calling, the model supports setting
tool_choice: "required", which forces the model to call a tool. The model
also supports named tool calls, which let you specify a tool by name in
tool_choice.
No other changes to usage and features are needed.
NVIDIA-Nemotron-Nano-9B-v2-DGX-Spark#
This NIM container variant was released with LLM NIM container version 1.14 and
uses the 1.0.0-variant tag. For more information, refer to the
1.14 version of this page.
Qwen3-Coder-Next#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_CUDA_GRAPH: Defaults to FalseNIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_JSONL_LOGGINGNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_RELAX_MEM_CONSTRAINTSNIM_REPOSITORY_OVERRIDENIM_REWARD_LOGITS_RANGENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
vLLM uses
guided_json,guided_choice, andguided_regexfollowed by a string.SGLang uses
response_format(json), similar to the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.When using tool calling with streams, all chunks contain streamed tool call content.
top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or integer (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
KV cache metrics
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.
Qwen3-Next-80B-A3B-Instruct#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_DISABLE_CUDA_GRAPH: Defaults to FalseNIM_DISABLE_OVERLAP_SCHEDULINGNIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_JSONL_LOGGINGNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_RELAX_MEM_CONSTRAINTSNIM_REPOSITORY_OVERRIDENIM_REWARD_LOGITS_RANGENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many variables lack error handling methods.
Structured output
vLLM uses
guided_json,guided_choice, andguided_regexfollowed by a string.SGLang uses
response_format(json), similar to the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.Tool calling with streams
For SGLang, the second-to-last chunk contains the complete tool call content.
For vLLM, all chunks contain streamed tool call content.
top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or integer (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
KV cache metrics
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.
Qwen3 Next 80B A3B Thinking#
This NIM container variant was released with LLM NIM container version 1.14 and
uses the 1.0.0-variant tag. For more information, refer to the
1.14 version of this page.
Qwen3-32B#
This NIM container variant was released with LLM NIM container version 1.14 and
uses the 1.0.0 tag. For more information, refer to the
1.14 version of this page.
Qwen3-32B NIM for DGX Spark#
This NIM container variant was released with LLM NIM container version 1.14 and
uses the 1.0.0-variant tag. For more information, refer to the
1.14 version of this page.
GLM-5#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_RELAX_MEM_CONSTRAINTSNIM_REPOSITORY_OVERRIDENIM_REWARD_LOGITS_RANGENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_TOKENIZER_MODE: Defaults to fast modeNIM_ENABLE_PROMPT_EMBEDSNIM_PER_REQ_METRICS_ENABLENIM_TELEMETRY_MODENIM_TELEMETRY_ENABLE_ON_RTXNIM_TELEMETRY_INTERVAL_MINUTESSSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
vLLM uses
guided_json,guided_choice, andguided_regexfollowed by a string.SGLang uses
response_format(json), similar to the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or integer (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
KV cache metrics
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.
Step-3.5-Flash#
Environment Variables#
Not Supported#
The following environment variables are not currently supported:
NIM_CUSTOM_GUIDED_DECODING_BACKENDSNIM_CUSTOM_MODEL_NAMENIM_ENABLE_DP_ATTENTIONNIM_ENABLE_KV_CACHE_HOST_OFFLOADNIM_ENABLE_PROMPT_LOGPROBSNIM_FORCE_TRUST_REMOTE_CODE: Defaults to TrueNIM_FT_MODELNIM_KV_CACHE_HOST_MEM_FRACTIONNIM_LOW_MEMORY_MODENIM_MANIFEST_ALLOW_UNSAFE: No longer requiredNIM_MAX_CPU_LORASNIM_MAX_GPU_LORASNIM_NUM_KV_CACHE_SEQ_LENSNIM_PEFT_REFRESH_INTERVALNIM_PEFT_SOURCENIM_RELAX_MEM_CONSTRAINTSNIM_REPOSITORY_OVERRIDENIM_REWARD_LOGITS_RANGENIM_REWARD_MODELNIM_REWARD_MODEL_STRINGNIM_TOKENIZER_MODE: Defaults to fast modeSSL_CERT_FILE: Set bothNIM_SSL_CERT_PATHandSSL_CERT_FILEto the same location
Note
Most of these variables are not used with an SGLang backend.
API Compatibility#
The following API features have differences according to the backend used:
Error handling: Many variables lack error handling methods, which can cause invalid cases to fail.
Structured output
vLLM uses
guided_json,guided_choice, andguided_regexfollowed by a string.SGLang uses
response_format(json), similar to the following:response_format={ "type": "json_schema", "schema": json string, }
include_stop_str_in_outputandcontinuous_usage_statsare not supported by SGLang.top_logprobsFor TRT-LLM and SGLang, the content of the final chunk is empty, signaling the end, with no
top_logprobs(that is,"finish_reason": "stop").For vLLM, the final chunk contains content.
Setting a stop word
For vLLM and TRT-LLM,
stop_reasonis used.For SGLang,
matched_stopis used.
Echo configuration
SGLang supports boolean or integer (
1or0) input.vLLM supports boolean or null input.
The following API features only have support at the function level:
logprobsGuided decoding (including
guided_whitespace_patternandstructured_generation)Role configuration
The following API features are not supported:
Reward
Llama API
Structured output (
guided_json,guided_choiceandguided_regex): Useresponse_formatinsteadnvext
nvext features are supported using different parameters in the top-level
payload.
Metrics#
The output of v1/metrics has differences according to the backend used (SGLang
versus vLLM). Different naming conventions for metrics are used, for example, SGLang includes a prefix for each metric.
Additional metrics related to GPU resources have been added.
The following v1/metrics are not supported:
Request success rate metrics:
request_success_totalrequest_failure_totalrequest_finish_total
KV cache metrics
Usage Changes and Features#
The container docker run command does not support the -u $(id -u)
parameter.
For air gap deployment, add the following parameters to
the docker run command:
-e NIM_DISABLE_MODEL_DOWNLOAD=1 -v :/opt/nim/workspace/ \
-v <local-model-path>:<model-weight-path>
No other changes to usage and features are needed.