Supported Architectures for Multi-LLM NIM#

Use this documentation to learn the details of supported architectures for the multi-LLM compatible NIM container.

Note

If you are looking for supported models for the LLM-specific NIM container, refer to Supported Models for NVIDIA NIM for LLMs instead.

Text-only Language Models#

The following table lists which text-only model architectures and inference engines have been verified to work with the multi-LLM compatible NIM container. Each cell indicates whether that specific feature is supported (Yes) or not supported (No) for the given model architecture and inference engine.

Model Architecture

vLLM

-

-

-

TRTLLM

-

-

-

SGLang

-

-

-

With Verified HF Model(s)

Base Model

LoRA

Function Calling

Guided Decoding

Base Model

LoRA

Function Calling

Guided Decoding

Base Model

LoRA

Function Calling

Guided Decoding

BartForConditionalGeneration

Verified Model:
facebook/bart-large-cnn

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

BloomForCausalLM

Verified Model:
bigscience/bloom-560m

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

ChatGLMModel

Verified Model:
THUDM/chatglm3-6b

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

DeciLMForCausalLM

Verified Model:
nvidia/Llama-3_3-Nemotron-Super-49B-v1

No

No

No

No

Yes

No

Yes

Yes

No

No

No

No

DeepseekV2ForCausalLM

Verified Model:
deepseek-ai/DeepSeek-V2-Lite-Chat

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

DeepseekV3ForCausalLM

Verified Model:
RedHatAI-DeepSeek-CoderV2

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

FalconForCausalLM

Verified Model:
tiiuae/falcon-7b

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

FalconMambaForCausalLM

Verified Model:
tiiuae/falcon-mamba-7b-instruct

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

GemmaForCausalLM

Verified Model:
google/gemma-1.1-2b-it

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

Gemma2ForCausalLM

Verified Model:
google/gemma-2-9b

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

GlmForCausalLM

Verified Model:
THUDM/glm-4-9b-chat-hf

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

GPTBigCodeForCausalLM

Verified Model:
bigcode/starcoder

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

GPT2LMHeadModel

Verified Models:
openai-community/gpt2
distilbert/distilgpt2

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

GPTNeoXForCausalLM

Verified Model:
EleutherAI/pythia-70

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

GraniteForCausalLM

Verified Models:
ibm-granite/granite-3.3-2b-instruct
ibm-granite/granite-3.3-8b-instruct

Yes

Yes

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

GraniteMoeForCausalLM

Verified Model:
ibm/PowerMoE-3b

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

GritLM

Verified Model:
GritLM-7B

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

InternLM2ForCausalLM

Verified Model:
internlm/internlm2-chat-7b

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

InternLM3ForCausalLM

Verified Model:
internlm/internlm3-8b-instruct

Yes

Yes

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

JambaForCausalLM

Verified Models:
ai21labs/AI21-Jamba-Mini-1.5
ai21labs/AI21-Jamba-Mini-1.6

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

LlamaForCausalLLM

Verified Models:
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3-8B

See More
openGPT-X/Teuken-7B-instruct-commercial-v0.4
RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8
utter-project/EuroLLM-9B-Instruct
meta-llama/Llama-2-7b-chat-hf
modularai/Llama-3.1-8B-Instruct-GGUF
petals-team/StableBeluga2
meta-llama/Llama-3.2-1B-Instruct
TinyLlama/TinyLlama-1.1B-Chat-v1.0
unsloth/phi-4 (base)
unsloth/phi-4 (Multi-LoRA)

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

MambaForCausalLM

Verified Model:
state-spaces/mamba-370m-hf

Yes

No

Yes

Yes

Yes

No

No

No

No

No

No

No

MistralForCausalLM

Verified Models:
mistralai/Codestral-22B-v0.1
mistralai/Mistral-7B-Instruct-v0.3

See More
mistralai/Mistral-Small-24B-Instruct-2501
speakleash/Bielik-11B-v2.3-Instruct
Fastweb/FastwebMIIA-7B
sarvamai/sarvam-m
PocketDoc/Dans-PersonalityEngine-V1.3.0-24b
Delta-Vector/Sol-Reaver-15B-Instruct
mistralai/Mistral-7B-Instruct-v0.2

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

MolmoForCausalLM

Verified Model:
allenai-molmo

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

Olmo2ForCausalLM

Verified Model:
allenai/OLMo-2-0425-1B

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

OlmoeForCausalLM

Verified Model:
allenai-OLMoe-1B

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

PhiMoEForCausalLM

Verified Model:
microsoft/Phi-3.5-MoE-instruct

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

Phi3ForCausalLM

Verified Model:
microsoft/Phi-3-mini-4k-instruct
microsoft/Phi-4

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

Phi3SmallForCausalLM

Verified Model:
microsoft/Phi-3-small-8k-instruct

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

QWenLMHeadModel

Verified Model:
Qwn-1_8B-Chat

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

Qwen2ForCausalLM

Verified Models:
Qwen/Qwen2.5-0.5B-Instruct
Qwen/Qwen2.5-1.5B-Instruct

See More
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-14B-Instruct-AWQ
Qwen/Qwen2.5-7B-Instruct-1M

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

Yes

Yes

Qwen2MoeForCausalLM

Verified Model:
Qwen/Qwen1.5-MoE-A2.7B-Chat

Yes

No

Yes

Yes

Yes

No

Yes

Yes

Yes

No

Yes

Yes

RWForCausalLM

Verified Model:
tiiuae/falcon-40b

Yes

No

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

SolarForCausalLM

Verified Model:
upstage/solar-pro-preview-instruct

Yes

Yes

Yes

Yes

No

No

No

No

No

No

No

No

StableLMEpochForCausalLM

Verified Model:
TroyDoesAI-Mermaid

Yes

No

Yes

Yes

No

No

No

No

No

No

No

No

StableLmForCausalLM

Verified Model:
AI4free-jarvis-3b

Yes

No

Yes

Yes

No

No

No

No

Yes

No

Yes

Yes

StarCoder2ForCausalLM

Verified Model:
bigcode/starcoder2-15b

Yes

No

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

Warning

NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it, you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the Safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.

Model Formats#

NVIDIA NIM for LLMs supports the following model formats:

  • HF safetensor

  • GGUF safetensor

  • Unified HF safetensor

  • TRTLLM checkpoint

  • TRTLLM engine

Hugging Face safetensors#

The HuggingFace safetensors checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer

Unified Hugging Face safetensors#

The Unified HuggingFace checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── hf_quant_config.json                  # [Required] HuggingFace quantization configuration

GGUF#

The GGUF checkpoint should have the following directory structure:

├── config.json                             # [Required] HuggingFace model configuration 
├── generation_config.json                  # [Optional] Parameters to guide text generation
├── Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf  # [Required] GGUF model weights stored as safetensors 
├── ...
├── model.safetensors.index.json            # [Optional] Weights mapping 
├── special_tokens_map.json                 # [Optional] Special tokens mapping 
├── tokenizer.json                          # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                   # [Optional] Configuration details for a specific model's tokenizer

TRTLLM checkpoints#

The TRTLLM checkpoints should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM pretrained configuration
    └── rank0.safetensors                 # [Required] TRTLLM checkpoint safetensors
    ├── ...

Note

The TRTLLM checkpoint root directory should include Hugging Face tokenizer and configuration files. The trtllm_ckpt subfolder should include the TRTLLM checkpoint configuration file and weight tensors.

TRTLLM engines#

The TRTLLM engine should have the following directory structure:

├── config.json                           # [Required] HuggingFace model configuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM engine configuration
    └── rank0.engine                      # [Required] TRTLLM serialized engine
    ├── ...

Note

NIM validates the required files in each folder structure. If one or more required files are missing, NIM raises an exception and prompts you to provide additional files. Tokenizer files are used for inference, and each backend defaults to nonoptimal tokenizers if these files are not present. Refer to Troubleshooting for more details.

Users should ensure that the tensor parallel and pipeline parallel sizes configured at start-up time using NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE matches the TRTLLM checkpoint and engine ranks. Otherwise, NIM throws an exception to prompt users to fix the inference configurations.

Model Precisions#

All full precision models can be deployed in NVIDIA NIM for LLMs given that the architecture and model formats are supported. For quantized models, refer to following section.

Quantization Formats#

NVIDIA NIM for LLMs supports the following quantization formats:

Quantization algorithm

vLLM

TRTLLM

SGLang

INT4 AWQ

Y

Y

Y

FP8

Y

Y

Y

NVFP4

N

Y

N

Vision-Language Models with Text-Only Capabilities#

Vision-Language Models (VLMs) can be deployed with text-only capabilities in the multi-LLM compatible NIM container by configuring the image processing limit. This approach allows VLMs to function as standard text-only language models while maintaining compatibility with the multi-LLM compatible NIM container. This method is only relevant for VLMs not already supported by a VLM-specific NIM. For models with full VLM capabilities, refer to the VLM NIM documentation instead.

Configuration#

To deploy a VLM with text-only capabilities, set the NIM_MAX_IMAGES_PER_PROMPT environment variable to control the vLLM image limit in NIM. This parameter determines the maximum number of images that can be processed per prompt, effectively enabling text-only operation when set to 0.

Example Deployment#

The following example demonstrates how to configure a VLM for text-only operation:

# Choose a container name for bookkeeping
export CONTAINER_NAME=LLM-NIM

# Set the multi-LLM NIM repository
export Repository=nim/nvidia/llm-nim

# Set the tag to latest or a specific version (for example, 1.13.0)
export TAG=latest

# Choose the multi-LLM NIM image from NGC
export IMG_NAME="nvcr.io/$Repository:$TAG"

# Set HF_TOKEN for downloading HuggingFace repository
export HF_TOKEN=hf_xxxxxx

# Choose a HuggingFace model 
export NIM_MODEL_NAME=hf://google/gemma-3-27b-it

# Choose a served model name 
export NIM_SERVED_MODEL_NAME=google/gemma-3-27b-it

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Add write permissions to the NIM cache for downloading model assets
chmod -R a+w "$LOCAL_NIM_CACHE"

# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_MODEL_NAME=$NIM_MODEL_NAME \
  -e NIM_SERVED_MODEL_NAME=$NIM_SERVED_MODEL_NAME \
  -e NIM_MAX_IMAGES_PER_PROMPT=0 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Note

When NIM_MAX_IMAGES_PER_PROMPT is set to 0, the VLM will process only text inputs and ignore any image data in the prompt, effectively operating as a text-only language model.

Expected Behavior and Error Handling#

When deploying a VLM with text-only capabilities, you may encounter the following error during the initial server launch due to a self health check:

Warning

ValueError: You set or defaulted to ‘{“image”: 0}’ in --limit-mm-per-prompt, but passed 1 image items in the same prompt.

This error does not indicate a deployment issue and can be safely ignored. The container will remain online and continue to accept requests normally after this error occurs. The error is generated during the initial health check when the system attempts to validate the VLM configuration with a test prompt that includes image data.