Skip to main content
Ctrl+K
NVIDIA Triton Inference Server - Home

Getting Started

  • Quickstart

User Guide

  • Deploying your trained model using Triton
  • Triton Architecture
  • Model Repository
  • Repository Agent
  • Model Configuration
  • Request Cancellation
  • Optimization
  • Ragged Batching
  • Rate Limiter
  • Model Analyzer
  • Model Management
  • Custom Operations
  • Decoupled Backends and Models
  • Triton Response Cache
  • Metrics
  • Triton Server Trace
  • Triton Inference Server Support for Jetson and JetPack
  • Version 1 to Version 2 Migration
  • Secure Deployment Considerations

Debugging

  • Debugging Guide
  • FAQ

Protocol Guides

  • HTTP/REST and GRPC Protocol
  • Inference Protocols and APIs
  • Binary Tensor Data Extension
  • Classification Extension
  • Generate Extension
  • Logging Extension
  • Model Configuration Extension
  • Model Repository Extension
  • Schedule Policy Extension
  • Sequence Extension
  • Shared-Memory Extension
  • Statistics Extension
  • Trace Extension
  • Parameters Extension

Customization Guide

  • Building Triton
  • Customize Triton Container
  • Testing Triton

Examples

  • Using Triton Inference Server as a shared library for execution on Jetson
  • Concurrent inference and dynamic batching

Client

  • Triton Client Libraries and Examples
  • Python tritonclient Package API
    • tritonclient
      • tritonclient.grpc
        • tritonclient.grpc.aio
        • tritonclient.grpc.auth
      • tritonclient.http
        • tritonclient.http.aio
        • tritonclient.http.auth
      • tritonclient.utils
        • tritonclient.utils.cuda_shared_memory
        • tritonclient.utils.shared_memory
  • Triton Java API
  • Generate Go Client stubs
  • Example Javascript Client Using Generated GRPC API
  • Example Java and Scala client Using Generated GRPC API

Performance Analyzer

  • Triton Performance Analyzer
  • Perf Analyzer Documentation
  • Recommended Installation Method
  • Quick Start
  • Perf Analyzer CLI
  • Inference Load Modes
  • Input Data
  • Measurement Modes
  • Benchmarking Triton via HTTP or gRPC endpoint
  • GenAI-Perf
  • GenAI-Perf Compare Subcommand
  • Profile Embeddings Models with GenAI-Perf
  • Generated File Structures
  • Profile Multiple LoRA Adapters
  • Profile Vision-Language Models with GenAI-Perf
  • Profile Ranking Models with GenAI-Perf
  • Profile Large Language Models with GenAI-Perf

Python Backend

  • Python Backend
  • Using Triton with Inferentia 1
  • Auto-Complete Example
  • BLS Example
  • Example of using BLS with decoupled models
  • Custom Metrics Example
  • Decoupled Model Examples
  • Model Instance Kind Example
  • JAX Example
  • Preprocessing Using Python Backend Example
  • Repository
  • Open issue

Python Module Index

t
 
t
- tritonclient
    tritonclient.grpc
    tritonclient.grpc.aio
    tritonclient.grpc.aio.auth
    tritonclient.grpc.auth
    tritonclient.http
    tritonclient.http.aio
    tritonclient.http.aio.auth
    tritonclient.http.auth
    tritonclient.utils
    tritonclient.utils.cuda_shared_memory
    tritonclient.utils.shared_memory

By NVIDIA

© Copyright 2018-2024, NVIDIA Corporation.

Last updated on Oct 25, 2024.

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact