Supported Models, Hardware, and OS#

The GPUs available on TensorRT-Cloud are subject to change dynamically as we enable more configurations. To view the latest supported hardware and software, run:

$ trt-cloud info

This will produce output similar to:

[I] Available runners:
┌─────────┬────────────┬──────────────────────────────────┬────────────────────┬───────────────────────────────┐
│ OS       GPU         TRT Versions (for ONNX builds)    TRT-LLM Versions    Command                       │
├─────────┼────────────┼──────────────────────────────────┼────────────────────┼───────────────────────────────┤
│ Linux    A100        10.0.1, 10.2.0                    0.11.0              --os=linux --gpu=A100         │
│ Linux    H100        10.0.1, 10.2.0                    0.11.0              --os=linux --gpu=H100         │
│ Windows  GTX1660TI   10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=GTX1660TI  │
│ Windows  RTX30508GB  10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTX30508GB │
│ Windows  RTX3070     10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTX3070    │
│ Windows  RTX4000SFF  10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTX4000SFF │
│ Windows  RTX4060TI   10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTX4060TI  │
│ Windows  RTX4070     10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTX4070    │
│ Windows  RTX4090     10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTX4090    │
│ Windows  RTXA5000    10.0.1, 10.2.0                    0.11.0              --os=windows --gpu=RTXA5000   │
└─────────┴────────────┴──────────────────────────────────┴────────────────────┴───────────────────────────────┘
[I] Available TRT-LLM Models:
meta-llama/Llama-2-70b-chat-hf
meta-llama/CodeLlama-34b-hf
meta-llama/CodeLlama-70b-hf
meta-llama/CodeLlama-7b-Python-hf
meta-llama/Meta-Llama-3-8B
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Llama-3.1-8B
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.3-70B-Instruct
...
google/gemma-2b
google/gemma-1.1-2b-it
google/gemma-2b-it
google/gemma-2-2b
google/gemma-2-27b-it
mistralai/Mistral-7B-v0.1
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
microsoft/phi-2
bigcode/starcoder2-7b
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Qwen/Qwen2.5-0.5B
Qwen/Qwen2.5-Coder-32B-Instruct

Quantization Modes Supported by Model Family#

Warning

Some models may have weight shapes that are not divisible by the specified tp_size or the default awq_block_size of 128 (currently, the awq_block_size is not modifiable). This can lead to errors during quantization. If you encounter such issues, try reducing the tp_size to ensure it’s compatible with the model’s architecture. The lower the number, the higher the chances that the model’s weight will be divisible by it.

Supported Weights Quantization Modes for Sweeps (sweep_config, trtllm_build, quantization, qformat) and Building LLM Models On-Demand (--quantization)#

Model Family

Model Weights Quantization Mode

full_prec

int4_wo

int4_awq

fp8

int8_wo

w4a8_awq

Llama 2, 3, 3.1, 3.2, 3.3, CodeLlama

Yes

Yes

Yes

Yes

Yes

Yes

Gemma, Gemma 2

Yes

Yes

Yes

Yes

Yes

No

Mistral, Mixtral

Yes

Yes

No

Yes

Yes

No

Qwen

Yes

No

No

Yes

No

No

Phi-2

Yes

No

No

No

No

No

Sweep config trtllm_build, quantization, and kv_cache_dtype has the following values: fp8, int8, null, and empty. Setting the value to null means KV cache quantization is disabled. If it doesn’t exist, sweep will give it a default value based on the model and hardware. When enabled, int8 qformat should have int8 kv_cache_dtype; fp8 qformat should have fp8 kv_cache_dtype.

For the build command, --quantize-kv-cache exists, meaning, KV cache quantization enabled and TensorRT-Cloud will automatically pick the correct KV cache datatype to match the weights quantization type.

The following table marks whether the quantization mode supports KV cache quantization given the model’s quantization mode.

KV Cache Quantization for Sweeps (sweep_config, trtllm_build, quantization, kv_cache_dtype) and Building LLM Models On-Demand (--quantize-kv-cache)#

Model Family

Model Weights Quantization Mode

full_prec

int4_wo

int4_awq

fp8

int8_wo

w4a8_awq

Llama 2, 3, 3.1, 3.2, 3.3, CodeLlama

No

No

Yes

Yes

No

No

Gemma, Gemma 2

Yes

No

Yes

Yes

No

No

Mistral, Mixtral

No

No

No

Yes

No

No

Qwen

No

No

No

Yes

No

No

Phi-2

No

No

No

No

No

No

Note

Building a TensorRT-LLM engine for each of these models on a GPU and data type of choice will be subject to VRAM availability on the target GPU.