Supported Architectures for LLM-agnostic NIM#

Use this documentation to learn the details of supported architectures for the LLM-agnostic NIM container.

Note

If you are looking for the list of supported models for the LLM-specific NIM container, refer to Supported Models for NVIDIA NIM for LLMs instead.

Text-only Language Models#

The following table lists which text-only model architectures and inference engines have been verified to work with the LLM-agnostic NIM container. Each cell indicates whether that specific feature is supported (Yes) or not supported (No) for the given model architecture and inference engine.

Model Architecture	vLLM	-	-	-	TRTLLM	-	-	-	SGLang	-	-	-
With Verified HF Model(s)	Base Model	LoRA	Function Calling	Guided Decoding	Base Model	LoRA	Function Calling	Guided Decoding	Base Model	LoRA	Function Calling	Guided Decoding
`BartForConditionalGeneration` Verified Model: `facebook/bart-large-cnn`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`BloomForCausalLM` Verified Model: `bigscience/bloom-560m`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`ChatGLMModel` Verified Model: `THUDM/chatglm3-6b`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`DeciLMForCausalLM` Verified Model: `nvidia/Llama-3_3-Nemotron-Super-49B-v1`	No	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`DeepseekV2ForCausalLM` Verified Model: `deepseek-ai/DeepSeek-V2-Lite-Chat`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`DeepseekV3ForCausalLM` Verified Model: `RedHatAI-DeepSeek-CoderV2`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`FalconForCausalLM` Verified Model: `tiiuae/falcon-7b`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`FalconMambaForCausalLM` Verified Model: `tiiuae/falcon-mamba-7b-instruct`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`GemmaForCausalLM` Verified Model: `google/gemma-1.1-2b-it`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`Gemma2ForCausalLM` Verified Model: `google/gemma-2-9b`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`GlmForCausalLM` Verified Model: `THUDM/glm-4-9b-chat-hf`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`GPTBigCodeForCausalLM` Verified Model: `bigcode/starcoder`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`GPT2LMHeadModel` Verified Models: `openai-community/gpt2` `distilbert/distilgpt2`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`GPTNeoXForCausalLM` Verified Model: `EleutherAI/pythia-70`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`GraniteForCausalLM` Verified Models: `ibm-granite/granite-3.3-2b-instruct` `ibm-granite/granite-3.3-8b-instruct`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`GraniteMoeForCausalLM` Verified Model: `ibm/PowerMoE-3b`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`GritLM` Verified Model: `GritLM-7B`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`InternLM2ForCausalLM` Verified Model: `internlm/internlm2-chat-7b`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`InternLM3ForCausalLM` Verified Model: `internlm/internlm3-8b-instruct`	Yes	Yes	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`JambaForCausalLM` Verified Models: `ai21labs/AI21-Jamba-Mini-1.5` `ai21labs/AI21-Jamba-Mini-1.6`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`LlamaForCausalLLM` Verified Models: `meta-llama/Llama-3.1-8B-Instruct` `meta-llama/Meta-Llama-3-8B` See More `openGPT-X/Teuken-7B-instruct-commercial-v0.4` `RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8` `utter-project/EuroLLM-9B-Instruct` `meta-llama/Llama-2-7b-chat-hf` `modularai/Llama-3.1-8B-Instruct-GGUF` `petals-team/StableBeluga2` `meta-llama/Llama-3.2-1B-Instruct` `TinyLlama/TinyLlama-1.1B-Chat-v1.0` `unsloth/phi-4 (base)` `unsloth/phi-4 (Multi-LoRA)`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`MambaForCausalLM` Verified Model: `state-spaces/mamba-370m-hf`	Yes	No	Yes	Yes	Yes	No	No	No	No	No	No	No
`MistralForCausalLM` Verified Models: `mistralai/Codestral-22B-v0.1` `mistralai/Mistral-7B-Instruct-v0.3` See More `mistralai/Mistral-Small-24B-Instruct-2501` `speakleash/Bielik-11B-v2.3-Instruct` `Fastweb/FastwebMIIA-7B` `sarvamai/sarvam-m` `PocketDoc/Dans-PersonalityEngine-V1.3.0-24b` `Delta-Vector/Sol-Reaver-15B-Instruct` `mistralai/Mistral-7B-Instruct-v0.2`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`MolmoForCausalLM` Verified Model: `allenai-molmo`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`Olmo2ForCausalLM` Verified Model: `allenai/OLMo-2-0425-1B`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`OlmoeForCausalLM` Verified Model: `allenai-OLMoe-1B`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`PhiMoEForCausalLM` Verified Model: `microsoft/Phi-3.5-MoE-instruct`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	No	No	No
`Phi3ForCausalLM` Verified Model: `microsoft/Phi-3-mini-4k-instruct`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`Phi3SmallForCausalLM` Verified Model: `microsoft/Phi-3-small-8k-instruct`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`QWenLMHeadModel` Verified Model: `Qwn-1_8B-Chat`	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`Qwen2ForCausalLM` Verified Models: `Qwen/Qwen2.5-0.5B-Instruct` `Qwen/Qwen2.5-1.5B-Instruct` See More `Qwen/Qwen2.5-7B-Instruct` `Qwen/Qwen2.5-14B-Instruct-AWQ` `Qwen/Qwen2.5-7B-Instruct-1M`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes
`Qwen2MoeForCausalLM` Verified Model: `Qwen/Qwen1.5-MoE-A2.7B-Chat`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	Yes
`RWForCausalLM` Verified Model: `tiiuae/falcon-40b`	Yes	No	Yes	Yes	Yes	No	Yes	Yes	No	No	No	No
`SolarForCausalLM` Verified Model: `upstage/solar-pro-preview-instruct`	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No	No
`StableLMEpochForCausalLM` Verified Model: `TroyDoesAI-Mermaid`	Yes	No	Yes	Yes	No	No	No	No	No	No	No	No
`StableLmForCausalLM` Verified Model: `AI4free-jarvis-3b`	Yes	No	Yes	Yes	No	No	No	No	Yes	No	Yes	Yes
`StarCoder2ForCausalLM` Verified Model: `bigcode/starcoder2-15b`	Yes	No	Yes	Yes	Yes	Yes	Yes	Yes	No	No	No	No

Warning

NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it, you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the Safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.

Model Format#

NVIDIA NIM for LLMs supports the following model formats:

HF safetensor
GGUF safetensor
Unified HF safetensor
TRTLLM checkpoint
TRTLLM engine

HuggingFace safetensors#

The HuggingFace safetensors checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer

Unified HuggingFace safetensors#

The Unified HuggingFace checkpoint should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model-00001-of-00004.safetensors      # [Required] Model weights stored as safetensors 
├── model-00002-of-00004.safetensors 
├── model-00003-of-00004.safetensors 
├── model-00004-of-00004.safetensors 
├── ...
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── hf_quant_config.json                  # [Required] HuggingFace quantization configuration

GGUF#

The GGUF checkpoint should have the following directory structure:

├── config.json                             # [Required] HuggingFace model confifuration 
├── generation_config.json                  # [Optional] Parameters to guide text generation
├── Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf  # [Required] GGUF model weights stored as safetensors 
├── ...
├── model.safetensors.index.json            # [Optional] Weights mapping 
├── special_tokens_map.json                 # [Optional] Special tokens mapping 
├── tokenizer.json                          # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                   # [Optional] Configuration details for a specific model's tokenizer

TRTLLM checkpoints#

The TRTLLM checkpoints should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM pretrained configuration
    └── rank0.safetensors                 # [Required] TRTLLM checkpoint safetensors
    ├── ...

Note

The TRTLLM checkpoint root directory should have HF tokenizer and configuration files. The sub-folder, called trtllm_ckpt in this example, should have TRTLLM checkpoint configuration file and weight tensors.

TRTLLM engines#

The TRTLLM engine should have the following directory structure:

├── config.json                           # [Required] HuggingFace model confifuration 
├── generation_config.json                # [Optional] Parameters to guide text generation
├── model.safetensors.index.json          # [Optional] Weights mapping 
├── special_tokens_map.json               # [Optional] Special tokens mapping 
├── tokenizer.json                        # [Optional] Tokenization method, vocabulary, pre-tokenization rules etc 
└── tokenizer_config.json                 # [Optional] Configuration details for a specific model's tokenizer
└── trtllm_ckpt
    ├── config.json                       # [Required] TRTLLM engine configuration
    └── rank0.engine                      # [Required] TRTLLM serialized engine
    ├── ...

Note

Required files in each of the above folder structures are strictly validated in NIM. If one or more required files are missing, an exception will be thrown and user will be prompted to provide additional files. Tokenizer files are used for inference and each backend will default to using non-optimal (default) tokenizers if these files are not present. Refer to Troubleshooting for more details.

Users should ensure that the tensor parallel and pipeline parallel sizes configured at start-up time using NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE matches the TRTLLM checkpoint and engine ranks. Otherwise, NIM throws an exception to prompt users to fix the inference configurations.

Model Precisions#

All full precision models can be deployed in NVIDIA NIM for LLMs given that the architecture and model formats are supported. For quantized models, refer to following section.

Quantization Formats#

NVIDIA NIM for LLMs supports the following quantization formats:

Quantization algorithm	vLLM	TRTLLM	SGLang
`INT4 AWQ`	Y	Y	Y
`FP8`	Y	Y	Y
`NVFP4`	N	Y	N