Supported Models, Hardware, and OS#
The GPUs available on TensorRT-Cloud are subject to change dynamically as we enable more configurations. To view the latest supported hardware and software, run:
$ trt-cloud info
This will produce output similar to:
[I] Available runners:
┌─────────┬────────────┬──────────────────────────────────┬────────────────────┬───────────────────────────────┐
│ OS │ GPU │ TRT Versions (for ONNX builds) │ TRT-LLM Versions │ Command │
├─────────┼────────────┼──────────────────────────────────┼────────────────────┼───────────────────────────────┤
│ Linux │ A100 │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=linux --gpu=A100 │
│ Linux │ H100 │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=linux --gpu=H100 │
│ Windows │ GTX1660TI │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=GTX1660TI │
│ Windows │ RTX30508GB │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTX30508GB │
│ Windows │ RTX3070 │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTX3070 │
│ Windows │ RTX4000SFF │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTX4000SFF │
│ Windows │ RTX4060TI │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTX4060TI │
│ Windows │ RTX4070 │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTX4070 │
│ Windows │ RTX4090 │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTX4090 │
│ Windows │ RTXA5000 │ 10.0.1, 10.2.0 │ 0.11.0 │ --os=windows --gpu=RTXA5000 │
└─────────┴────────────┴──────────────────────────────────┴────────────────────┴───────────────────────────────┘
[I] Available TRT-LLM Models:
meta-llama/Llama-2-70b-chat-hf
meta-llama/CodeLlama-34b-hf
meta-llama/CodeLlama-70b-hf
meta-llama/CodeLlama-7b-Python-hf
meta-llama/Meta-Llama-3-8B
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Llama-3.1-8B
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.3-70B-Instruct
...
google/gemma-2b
google/gemma-1.1-2b-it
google/gemma-2b-it
google/gemma-2-2b
google/gemma-2-27b-it
mistralai/Mistral-7B-v0.1
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
microsoft/phi-2
bigcode/starcoder2-7b
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Qwen/Qwen2.5-0.5B
Qwen/Qwen2.5-Coder-32B-Instruct
Quantization Modes Supported by Model Family#
Warning
Some models may have weight shapes that are not divisible by the specified tp_size
or the default awq_block_size
of 128
(currently, the awq_block_size
is not modifiable). This can lead to errors during quantization. If you encounter such issues, try reducing the tp_size to ensure it’s compatible with the model’s architecture. The lower the number, the higher the chances that the model’s weight will be divisible by it.
Model Family |
Model Weights Quantization Mode |
|||||
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
Llama 2, 3, 3.1, 3.2, 3.3, CodeLlama |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Gemma, Gemma 2 |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Mistral, Mixtral |
Yes |
Yes |
No |
Yes |
Yes |
No |
Qwen |
Yes |
No |
No |
Yes |
No |
No |
Phi-2 |
Yes |
No |
No |
No |
No |
No |
Sweep config trtllm_build
, quantization
, and kv_cache_dtype
has the following values: fp8
, int8
, null
, and empty
. Setting the value to null
means KV cache quantization is disabled. If it doesn’t exist, sweep will give it a default value based on the model and hardware. When enabled, int8 qformat
should have int8 kv_cache_dtype
; fp8 qformat
should have fp8 kv_cache_dtype
.
For the build command, --quantize-kv-cache
exists, meaning, KV cache quantization enabled and TensorRT-Cloud will automatically pick the correct KV cache datatype to match the weights quantization type.
The following table marks whether the quantization mode supports KV cache quantization given the model’s quantization mode.
Model Family |
Model Weights Quantization Mode |
|||||
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
Llama 2, 3, 3.1, 3.2, 3.3, CodeLlama |
No |
No |
Yes |
Yes |
No |
No |
Gemma, Gemma 2 |
Yes |
No |
Yes |
Yes |
No |
No |
Mistral, Mixtral |
No |
No |
No |
Yes |
No |
No |
Qwen |
No |
No |
No |
Yes |
No |
No |
Phi-2 |
No |
No |
No |
No |
No |
No |
Note
Building a TensorRT-LLM engine for each of these models on a GPU and data type of choice will be subject to VRAM availability on the target GPU.