<no title> — NVIDIA Triton Inference Server

Skip to main content

Ctrl+K

NVIDIA Triton Inference Server

GitHub

NVIDIA Triton Inference Server

GitHub

Table of Contents

Home
Release notes
Compatibility matrix

Getting Started

Quick Deployment Guide by backend
- TRT-LLM
- vLLM
- Python with HuggingFace
- PyTorch
- ONNX
- TensorFlow
- Openvino
LLM With TRT-LLM
Multimodal model
Stable diffusion

Scaling guide

Multi-Node (AWS)
Multi-Instance

LLM Features

Constrained Decoding
Function Calling
Speculative Decoding
- TRT-LLM
- vLLM

Client

API Reference
- OpenAI API
- KServe API
  - HTTP/REST and GRPC Protocol
  - Extensions
    
    Binary tensor data extension
    
    Classification extension
    
    Schedule policy extension
    
    Sequence extension
    
    Shared-memory extension
    
    Model configuration extension
    
    Model repository extension
    
    Statistics extension
    
    Trace extension
    
    Logging extension
    
    Parameters extension
In-Process Triton Server API
- C/C++
- Python
  - Kafka I/O
  - Rayserve
- Java
Client Libraries
Python tritonclient Package API
- tritonclient

Server

Concurrent Model Execution
Scheduler
Batcher
Model Pipelines
- Ensemble
- Business Logic Scripting
State Management
- Implicit State Management
Request Cancellation
Rate Limiter
Caching
Metrics
Tracing

Model Management

Repository
Configuration
Optimization
Controls
Decoupled models
Custom operators

Backends

TRT-LLM
vLLM
- vLLM Backend
- Multi-LoRA
Python Backend
PyTorch (LibTorch) Backend
ONNX Runtime
TensorFlow
TensorRT
FIL
DALI
Custom

Perf benchmarking and tuning

GenAI Perf Analyzer
Performance Analyzer
Model Analyzer
Model Navigator

Debugging

Guide

<no title>

Perf Analyzer documentation has been relocated to here.

On this page

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2018-2025, NVIDIA Corporation.