Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Deploy NeMo Models by Exporting to Inference Optimized Libraries#
NeMo Framework offers scripts and APIs to export models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. Please check the table below to see which models are supported.
Supported LLMs#
The following table shows the supported LLMs and their inference-optimized libraries in the distributed NeMo checkpoint format.
Model Name |
Model Parameters |
TensorRT-LLM |
vLLM |
|---|---|---|---|
GPT |
2B, 8B, 43B |
✓ |
✗ |
Nemotron |
8B, 22B |
✓ |
✗ |
Llama 2 |
7B, 13B, 70B |
✓ |
✓ |
Llama 3 |
8B, 70B |
✓ |
✓ |
Llama 3.1 |
8B, 70B, 405B |
✓ |
✗ |
Falcon |
7B, 40B |
✓ |
✗ |
Gemma |
2B, 7B |
✓ |
✓ |
StarCoder1 |
15B |
✓ |
✗ |
StarCoder2 |
3B, 7B, 15B |
✓ |
✓ |
MISTRAL |
7B |
✓ |
✓ |
MIXTRAL |
8x7B |
✓ |
✓ |
You can find details about TensorRT-LLM and vLLM-based deployment options below.