Supported Architectures for Multi-LLM NIM#

Use this documentation to learn the details of supported architectures for the multi-LLM compatible NIM container.

Note

If you are looking for supported models for the LLM-specific NIM container, refer to Supported Models for NVIDIA NIM for LLMs instead.

Text-only Language Models#

The following table lists which text-only model architectures and inference engines have been verified to work with the multi-LLM compatible NIM container. Each cell indicates whether that specific feature is supported (Yes) or not supported (No) for the given model architecture and inference engine.

Model Architecture	vLLM	-	-	-	TRTLLM	-	-	-	SGLang	-	-	-
With Verified HF Model(s)	Base Model	LoRA	Function Calling	Guided Decoding	Base Model	LoRA	Function Calling	Guided Decoding	Base Model	LoRA	Function Calling	Guided Decoding
`BartForConditionalGeneration` Verified Model: `facebook/bart-large-cnn`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`BloomForCausalLM` Verified Model: `bigscience/bloom-560m`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`ChatGLMModel` Verified Model: `THUDM/chatglm3-6b`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`DeciLMForCausalLM` Verified Model: `nvidia/Llama-3_3-Nemotron-Super-49B-v1`	No	No	No	No	Yes	No	Yes	Yes	No	No	No	No
`DeepseekV2ForCausalLM` Verified Model: `deepseek-ai/DeepSeek-V2-Lite-Chat`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`DeepseekV3ForCausalLM` Verified Model: `RedHatAI-DeepSeek-CoderV2`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`FalconForCausalLM` Verified Model: `tiiuae/falcon-7b`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`FalconMambaForCausalLM` Verified Model: `tiiuae/falcon-mamba-7b-instruct`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`GemmaForCausalLM` Verified Model: `google/gemma-1.1-2b-it`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`Gemma2ForCausalLM` Verified Model: `google/gemma-2-9b`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`GlmForCausalLM` Verified Model: `THUDM/glm-4-9b-chat-hf`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`GPTBigCodeForCausalLM` Verified Model: `bigcode/starcoder`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`GPT2LMHeadModel` Verified Models: `openai-community/gpt2` `distilbert/distilgpt2`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`GPTNeoXForCausalLM` Verified Model: `EleutherAI/pythia-70`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`GraniteForCausalLM` Verified Models: `ibm-granite/granite-3.3-2b-instruct` `ibm-granite/granite-3.3-8b-instruct`	Yes	Yes	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`GraniteMoeForCausalLM` Verified Model: `ibm/PowerMoE-3b`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`GritLM` Verified Model: `GritLM-7B`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`InternLM2ForCausalLM` Verified Model: `internlm/internlm2-chat-7b`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`InternLM3ForCausalLM` Verified Model: `internlm/internlm3-8b-instruct`	Yes	Yes	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`JambaForCausalLM` Verified Models: `ai21labs/AI21-Jamba-Mini-1.5` `ai21labs/AI21-Jamba-Mini-1.6`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`LlamaForCausalLLM` Verified Models: `meta-llama/Llama-3.1-8B-Instruct` `meta-llama/Meta-Llama-3-8B` See More `openGPT-X/Teuken-7B-instruct-commercial-v0.4` `RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8` `utter-project/EuroLLM-9B-Instruct` `meta-llama/Llama-2-7b-chat-hf` `modularai/Llama-3.1-8B-Instruct-GGUF` `petals-team/StableBeluga2` `meta-llama/Llama-3.2-1B-Instruct` `TinyLlama/TinyLlama-1.1B-Chat-v1.0` `unsloth/phi-4 (base)` `unsloth/phi-4 (Multi-LoRA)`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`MambaForCausalLM` Verified Model: `state-spaces/mamba-370m-hf`	Yes	No	Yes	Yes	Yes	No	No	No	No	No	No	No
`MistralForCausalLM` Verified Models: `mistralai/Codestral-22B-v0.1` `mistralai/Mistral-7B-Instruct-v0.3` See More `mistralai/Mistral-Small-24B-Instruct-2501` `speakleash/Bielik-11B-v2.3-Instruct` `Fastweb/FastwebMIIA-7B` `sarvamai/sarvam-m` `PocketDoc/Dans-PersonalityEngine-V1.3.0-24b` `Delta-Vector/Sol-Reaver-15B-Instruct` `mistralai/Mistral-7B-Instruct-v0.2`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`MolmoForCausalLM` Verified Model: `allenai-molmo`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`Olmo2ForCausalLM` Verified Model: `allenai/OLMo-2-0425-1B`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`OlmoeForCausalLM` Verified Model: `allenai-OLMoe-1B`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`PhiMoEForCausalLM` Verified Model: `microsoft/Phi-3.5-MoE-instruct`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	No	No	No
`Phi3ForCausalLM` Verified Model: `microsoft/Phi-3-mini-4k-instruct` `microsoft/Phi-4`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`Phi3SmallForCausalLM` Verified Model: `microsoft/Phi-3-small-8k-instruct`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`QWenLMHeadModel` Verified Model: `Qwn-1_8B-Chat`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`Qwen2ForCausalLM` Verified Models: `Qwen/Qwen2.5-0.5B-Instruct` `Qwen/Qwen2.5-1.5B-Instruct` See More `Qwen/Qwen2.5-7B-Instruct` `Qwen/Qwen2.5-14B-Instruct-AWQ` `Qwen/Qwen2.5-7B-Instruct-1M`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`Qwen2MoeForCausalLM` Verified Model: `Qwen/Qwen1.5-MoE-A2.7B-Chat`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`RWForCausalLM` Verified Model: `tiiuae/falcon-40b`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`SolarForCausalLM` Verified Model: `upstage/solar-pro-preview-instruct`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`StableLMEpochForCausalLM` Verified Model: `TroyDoesAI-Mermaid`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`StableLmForCausalLM` Verified Model: `AI4free-jarvis-3b`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`StarCoder2ForCausalLM` Verified Model: `bigcode/starcoder2-15b`	Yes	No	Yes	Yes	Yes	Yes	Yes	Yes	No	No	No	No

Warning

NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it, you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the Safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.

Model Formats#

NVIDIA NIM for LLMs supports the following model formats:

HF safetensor
GGUF safetensor
Unified HF safetensor
TRTLLM checkpoint
TRTLLM engine

Hugging Face safetensors#

The HuggingFace safetensors checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer

Unified Hugging Face safetensors#

The Unified HuggingFace checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── hf_quant_config.json                  # [Required] HuggingFace quantization configuration

GGUF#

The GGUF checkpoint should have the following directory structure:

├── config.json                             # [Required] HuggingFace model configuration 
├── generation_config.json                  # [Optional] Parameters to guide text generation
├── Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf  # [Required] GGUF model weights stored as safetensors 
├── ...
├── model.safetensors.index.json            # [Optional] Weights mapping 
├── special_tokens_map.json                 # [Optional] Special tokens mapping 
├── tokenizer.json                          # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                   # [Optional] Configuration details for a specific model's tokenizer

TRTLLM checkpoints#

The TRTLLM checkpoints should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM pretrained configuration
    └── rank0.safetensors                 # [Required] TRTLLM checkpoint safetensors
    ├── ...

Note

The TRTLLM checkpoint root directory should include Hugging Face tokenizer and configuration files. The trtllm_ckpt subfolder should include the TRTLLM checkpoint configuration file and weight tensors.

TRTLLM engines#

The TRTLLM engine should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM engine configuration
    └── rank0.engine                      # [Required] TRTLLM serialized engine
    ├── ...

Note

NIM validates the required files in each folder structure. If one or more required files are missing, NIM raises an exception and prompts you to provide additional files. Tokenizer files are used for inference, and each backend defaults to nonoptimal tokenizers if these files are not present. Refer to Troubleshooting for more details.

Users should ensure that the tensor parallel and pipeline parallel sizes configured at start-up time using NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE matches the TRTLLM checkpoint and engine ranks. Otherwise, NIM throws an exception to prompt users to fix the inference configurations.

Model Precisions#

All full precision models can be deployed in NVIDIA NIM for LLMs given that the architecture and model formats are supported. For quantized models, refer to following section.

Quantization Formats#

NVIDIA NIM for LLMs supports the following quantization formats:

Quantization algorithm	vLLM	TRTLLM	SGLang
`INT4 AWQ`	Y	Y	Y
`FP8`	Y	Y	Y
`NVFP4`	N	Y	N

Vision-Language Models with Text-Only Capabilities#

Vision-Language Models (VLMs) can be deployed with text-only capabilities in the multi-LLM compatible NIM container by configuring the image processing limit. This approach allows VLMs to function as standard text-only language models while maintaining compatibility with the multi-LLM compatible NIM container. This method is only relevant for VLMs not already supported by a VLM-specific NIM. For models with full VLM capabilities, refer to the VLM NIM documentation instead.

Configuration#

To deploy a VLM with text-only capabilities, set the NIM_MAX_IMAGES_PER_PROMPT environment variable to control the vLLM image limit in NIM. This parameter determines the maximum number of images that can be processed per prompt, effectively enabling text-only operation when set to 0.

Example Deployment#

The following example demonstrates how to configure a VLM for text-only operation:

# Choose a container name for bookkeeping
export CONTAINER_NAME=LLM-NIM

# Set the multi-LLM NIM repository
export Repository=nim/nvidia/llm-nim

# Set the tag to latest or a specific version (for example, 1.13.0)
export TAG=latest

# Choose the multi-LLM NIM image from NGC
export IMG_NAME="nvcr.io/$Repository:$TAG"

# Set HF_TOKEN for downloading HuggingFace repository
export HF_TOKEN=hf_xxxxxx

# Choose a HuggingFace model 
export NIM_MODEL_NAME=hf://google/gemma-3-27b-it

# Choose a served model name 
export NIM_SERVED_MODEL_NAME=google/gemma-3-27b-it

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Add write permissions to the NIM cache for downloading model assets
chmod -R a+w "$LOCAL_NIM_CACHE"

# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_MODEL_NAME=$NIM_MODEL_NAME \
  -e NIM_SERVED_MODEL_NAME=$NIM_SERVED_MODEL_NAME \
  -e NIM_MAX_IMAGES_PER_PROMPT=0 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Note

When NIM_MAX_IMAGES_PER_PROMPT is set to 0, the VLM will process only text inputs and ignore any image data in the prompt, effectively operating as a text-only language model.

Expected Behavior and Error Handling#

When deploying a VLM with text-only capabilities, you may encounter the following error during the initial server launch due to a self health check:

Warning

ValueError: You set or defaulted to ‘{“image”: 0}’ in --limit-mm-per-prompt, but passed 1 image items in the same prompt.

This error does not indicate a deployment issue and can be safely ignored. The container will remain online and continue to accept requests normally after this error occurs. The error is generated during the initial health check when the system attempts to validate the VLM configuration with a test prompt that includes image data.