Skip to main content
Ctrl+K
NVIDIA Triton Inference Server - Home

NVIDIA Triton Inference Server

  • GitHub
NVIDIA Triton Inference Server - Home

NVIDIA Triton Inference Server

  • GitHub

Table of Contents

  • Home
  • Release notes
  • Compatibility matrix

Getting Started

  • Quick Deployment Guide by backend
    • TRT-LLM
    • vLLM
    • Python with HuggingFace
    • PyTorch
    • ONNX
    • TensorFlow
    • Openvino
  • LLM With TRT-LLM
  • Multimodal model
  • Stable diffusion

Scaling guide

  • Multi-Node (AWS)
  • Multi-Instance

LLM Features

  • Constrained Decoding
  • Function Calling
  • Speculative Decoding
    • TRT-LLM
    • vLLM

Client

  • API Reference
    • OpenAI API
    • KServe API
      • HTTP/REST and GRPC Protocol
      • Extensions
        • Binary tensor data extension
        • Classification extension
        • Schedule policy extension
        • Sequence extension
        • Shared-memory extension
        • Model configuration extension
        • Model repository extension
        • Statistics extension
        • Trace extension
        • Logging extension
        • Parameters extension
  • In-Process Triton Server API
    • C/C++
    • Python
      • Kafka I/O
      • Rayserve
    • Java
  • Client Libraries
  • Python tritonclient Package API
    • tritonclient
      • tritonclient.grpc
        • tritonclient.grpc.aio
        • tritonclient.grpc.auth
      • tritonclient.http
        • tritonclient.http.aio
        • tritonclient.http.auth
      • tritonclient.utils
        • tritonclient.utils.shared_memory

Server

  • Concurrent Model Execution
  • Scheduler
  • Batcher
  • Model Pipelines
    • Ensemble
    • Business Logic Scripting
  • State Management
    • Implicit State Management
  • Request Cancellation
  • Rate Limiter
  • Caching
  • Metrics
  • Tracing

Model Management

  • Repository
  • Configuration
  • Optimization
  • Controls
  • Decoupled models
  • Custom operators

Backends

  • TRT-LLM
  • vLLM
    • vLLM Backend
    • Multi-LoRA
  • Python Backend
  • PyTorch (LibTorch) Backend
  • ONNX Runtime
  • TensorFlow
  • TensorRT
  • FIL
  • DALI
  • Custom

Perf benchmarking and tuning

  • GenAI Perf Analyzer
    • Large language models
    • Visual language models
    • Embedding models
    • Ranking models
    • Multiple LoRA adapters
  • Performance Analyzer
    • Recommended Installation Method
    • Inference Load Modes
    • Input Data
    • Measurement Modes
  • Model Analyzer
    • Model Analyzer CLI
    • Launch Modes
    • Table of Contents
    • Model Analyzer Metrics
    • Table of Contents
    • Checkpointing in Model Analyzer
    • Model Analyzer Reports
    • Deploying Model Analyzer on a Kubernetes cluster
  • Model Navigator

Debugging

  • Guide
  • Navigator Package

Navigator Package#

The model graph and/or checkpoint is not enough to perform a successful deployment of the model. When you are deploying model for inference you need to be aware of model inputs and outputs definition, maximal batch size that can be used for inference and other.

On that purpose, we have created a Navigator Package - an artifact containing the serialized model, model metadata and optimization details.

The Navigator Package is a recommended way of sharing the optimized model for deployment on PyTriton or Triton Inference Server sections or re-running the optimize method on different hardware.

Save#

The package created during models optimize can be saved in form of Zip file using the API method:

import model_navigator as nav

nav.package.save(
    package=package,
    path="/path/to/package.nav"
)

The save method collect the generated models from workspace selecting:

  • base formats - first available serialization formats exporting model from source

  • max throughput format - the model that achieved the highest throughput during profiling

  • min latency format - the model that achieved the minimal latency during profiling

Additionally, the package contains:

  • status file with optimization details

  • logs from optimize execution

  • reproduction script per each model format

  • input and output data samples in form on numpy files

Read more in save method API specification.

Load#

The packages saved to file can be loaded for further processing:

import model_navigator as nav

package = nav.package.load(
    path="/path/to/package.nav"
)

Once the package is loaded, you can obtain desired information or use it to optimize or profile the package. Read more in load method API specification.

Optimize#

The loaded package object can be used to re-run the optimize process. In comparison to the framework dedicated API, the package optimize process starts from the serialized models inside the package and reproduces the available optimization paths. This step can be used to reproduce the process without access to sources on different hardware.

The optimization from the package can be run using:

import model_navigator as nav

optimized_package = nav.package.optimize(
    package=package
)

At the end of the process, the new optimized models are generated. Please be aware, the workspace is overridden in this step. Read more in optimize method API specification.

Profile#

The optimize process uses a single sample from dataloader for profiling. The process is focusing on selecting the best model format, and this requires an unequivocal sample for performance comparison.

In some cases, you may want to profile the models on different dataset. For that purpose, the Triton Model Navigator exposes the API for profiling all samples in the dataset for each model:

import torch
import model_navigator as nav

profiling_results = nav.package.profile(
    package=package,
    dataloader=[torch.randn(1, 3, 256, 256), torch.randn(1, 3, 512, 512)],
)

The results contain profiling information per each model and sample. You can use it to perform desired analysis based on the results. Read more in profile method API specification.

On this page
  • Save
  • Load
  • Optimize
  • Profile
NVIDIA NVIDIA
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2018-2025, NVIDIA Corporation.