Supported Models, Hardware, and OS#

The GPUs available on TensorRT-Cloud are subject to change dynamically as we enable more configurations. To view the latest supported hardware and software, run:

$ trt-cloud info

This will produce output similar to:

[I] Available runners:
┌─────────┬────────────┬──────────────────────────────────┬────────────────────┬───────────────────────────────┐
│ OS      │ GPU        │ TRT Versions (for ONNX builds)   │ TRT-LLM Versions   │ Command                       │
├─────────┼────────────┼──────────────────────────────────┼────────────────────┼───────────────────────────────┤
│ Linux   │ A100       │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=linux --gpu=A100         │
│ Linux   │ H100       │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=linux --gpu=H100         │
│ Windows │ GTX1660TI  │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=GTX1660TI  │
│ Windows │ RTX30508GB │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTX30508GB │
│ Windows │ RTX3070    │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTX3070    │
│ Windows │ RTX4000SFF │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTX4000SFF │
│ Windows │ RTX4060TI  │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTX4060TI  │
│ Windows │ RTX4070    │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTX4070    │
│ Windows │ RTX4090    │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTX4090    │
│ Windows │ RTXA5000   │ 10.0.1, 10.2.0                   │ 0.11.0             │ --os=windows --gpu=RTXA5000   │
└─────────┴────────────┴──────────────────────────────────┴────────────────────┴───────────────────────────────┘
[I] Available TRT-LLM Models:
meta-llama/Llama-2-70b-chat-hf
meta-llama/CodeLlama-34b-hf
meta-llama/CodeLlama-70b-hf
meta-llama/CodeLlama-7b-Python-hf
meta-llama/Meta-Llama-3-8B
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Llama-3.1-8B
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.3-70B-Instruct
...
google/gemma-2b
google/gemma-1.1-2b-it
google/gemma-2b-it
google/gemma-2-2b
google/gemma-2-27b-it
mistralai/Mistral-7B-v0.1
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
microsoft/phi-2
bigcode/starcoder2-7b
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Qwen/Qwen2.5-0.5B
Qwen/Qwen2.5-Coder-32B-Instruct

Quantization Modes Supported by Model Family#

Warning

Some models may have weight shapes that are not divisible by the specified tp_size or the default awq_block_size of 128 (currently, the awq_block_size is not modifiable). This can lead to errors during quantization. If you encounter such issues, try reducing the tp_size to ensure it’s compatible with the model’s architecture. The lower the number, the higher the chances that the model’s weight will be divisible by it.

Supported Weights Quantization Modes for Sweeps (`sweep_config`, `trtllm_build`, `quantization`, `qformat`) and Building LLM Models On-Demand (`--quantization`)#
Model Family	Model Weights Quantization Mode
Model Family	`full_prec`	`int4_wo`	`int4_awq`	`fp8`	`int8_wo`	`w4a8_awq`
Llama 2, 3, 3.1, 3.2, 3.3, CodeLlama	Yes	Yes	Yes	Yes	Yes	Yes
Gemma, Gemma 2	Yes	Yes	Yes	Yes	Yes	No
Mistral, Mixtral	Yes	Yes	No	Yes	Yes	No
Qwen	Yes	No	No	Yes	No	No
Phi-2	Yes	No	No	No	No	No

Sweep config trtllm_build, quantization, and kv_cache_dtype has the following values: fp8, int8, null, and empty. Setting the value to null means KV cache quantization is disabled. If it doesn’t exist, sweep will give it a default value based on the model and hardware. When enabled, int8 qformat should have int8 kv_cache_dtype; fp8 qformat should have fp8 kv_cache_dtype.

For the build command, --quantize-kv-cache exists, meaning, KV cache quantization enabled and TensorRT-Cloud will automatically pick the correct KV cache datatype to match the weights quantization type.

The following table marks whether the quantization mode supports KV cache quantization given the model’s quantization mode.

KV Cache Quantization for Sweeps (`sweep_config`, `trtllm_build`, `quantization`, `kv_cache_dtype`) and Building LLM Models On-Demand (`--quantize-kv-cache`)#
Model Family	Model Weights Quantization Mode
Model Family	`full_prec`	`int4_wo`	`int4_awq`	`fp8`	`int8_wo`	`w4a8_awq`
Llama 2, 3, 3.1, 3.2, 3.3, CodeLlama	No	No	Yes	Yes	No	No
Gemma, Gemma 2	Yes	No	Yes	Yes	No	No
Mistral, Mixtral	No	No	No	Yes	No	No
Qwen	No	No	No	Yes	No	No
Phi-2	No	No	No	No	No	No

Note

Building a TensorRT-LLM engine for each of these models on a GPU and data type of choice will be subject to VRAM availability on the target GPU.