Skip to main content

Ctrl+K

NVIDIA Triton Inference Server

GitHub

NVIDIA Triton Inference Server

GitHub

Table of Contents

Home
Release notes
Compatibility matrix

Getting Started

Quick Deployment Guide by backend
- Quickstart
- TRT-LLM
- vLLM
- Python with HuggingFace
- PyTorch
- ONNX
- Openvino
LLM With TensorRT-LLM
Multimodal model
Stable diffusion
HSTU (Generative Recommenders)

Scaling guide

Multi-Node (AWS)
Multi-Instance

LLM Features

Constrained Decoding
Function Calling
Speculative Decoding
- Overview
- TRT-LLM
- vLLM

Client

API Reference
- OpenAI API
- KServe API
  - HTTP/REST and GRPC Protocol
  - Extensions
    
    Binary tensor data extension
    
    Classification extension
    
    Schedule policy extension
    
    Sequence extension
    
    Shared-memory extension
    
    Model configuration extension
    
    Model repository extension
    
    Statistics extension
    
    Trace extension
    
    Logging extension
    
    Parameters extension
In-Process Triton Server API
- C/C++
- Python
- Java
Client Libraries
Python tritonclient Package API
- tritonclient

Server

Concurrent Model Execution
Scheduler
Batcher
Model Pipelines
- Ensemble
- Business Logic Scripting
State Management
- Implicit State Management
Request Cancellation
Rate Limiter
Caching
Metrics
Tracing

Model Management

Repository
Configuration
Optimization
Controls
Decoupled models
Custom operators

Backends

TensorRT-LLM
vLLM
- vLLM Backend
- Multi-LoRA
Python Backend
PyTorch
ONNX Runtime
TensorRT
FIL
DALI
Custom

Performance benchmarking and tuning

GenAI Perf Analyzer
Performance Analyzer
Model Analyzer
Model Navigator

Debugging

Guide

GenAI...

GenAI Performance Analyzer#

previous

Triton Inference Server Backend

next

GenAI-Perf

Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2018-2026, NVIDIA Corporation.