Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Deploy NeMo Models by Exporting to Inference Optimized Libraries
NeMo Framework offers scripts and APIs to export models to two inference optimized libraries, TensorRT-LLM and vLLM, and to deploy the exported model with the NVIDIA Triton Inference Server. Please check the table below to see which models are supported.
Supported LLMs
The following table shows the supported LLMs and their inference-optimized libraries in the distributed NeMo checkpoint format.
Model Name |
Model Parameters |
TensorRT-LLM |
vLLM |
---|---|---|---|
GPT |
2B, 8B, 43B |
✓ |
✗ |
Nemotron |
8B, 22B |
✓ |
✗ |
Llama 2 |
7B, 13B, 70B |
✓ |
✓ |
Llama 3 |
8B, 70B |
✓ |
✓ |
Llama 3.1 |
8B, 70B, 405B |
✓ |
✗ |
Falcon |
7B, 40B |
✓ |
✗ |
Gemma |
2B, 7B |
✓ |
✓ |
StarCoder1 |
15B |
✓ |
✗ |
StarCoder2 |
3B, 7B, 15B |
✓ |
✓ |
MISTRAL |
7B |
✓ |
✓ |
MIXTRAL |
8x7B |
✓ |
✓ |
You can find details about TensorRT-LLM and vLLM-based deployment options below.