NVIDIA Maxine Triton Inference Server User Guide#
The NVIDIA Triton inference server (https://developer.nvidia.com/nvidia-triton-inference-server) is software that helps to deploy AI features in the cloud at scale. The Triton-enabled NVIDIA Maxine Augmented Reality (AR) and Video Effects (VFX) SDKs are well suited for servers and microservices because the SDKs use several Triton features to deliver
High throughput and resource utilization through dynamic batching, batched inference, and concurrent model execution.
Multistream support to process multiple video streams concurrently.
MultiGPU and MIG support.
With the Triton-enabled SDKs, the Maxine SDK features run fully on a Triton server and the SDK library provides C APIs for the user’s application to communicate with the server via gRPC. Because the server supports multi-stream processing, several applications can send workload to a single server at the same time, each in turn requesting processing on more than one input video stream.
The following features are included in the Triton-enabled AR SDK:
Face Detection
Facial Landmark Detection
Eye Contact
Video Live Portrait
Speech Live Portrait
LipSync
The following features are included in the Triton-enabled VFX SDK:
AI Green Screen
Background Blur
Video Relighting
The image data is transferred between the application and the server as raw frames using gRPC, or using shared memory in cases when the server and the application run on the same machine. This limits the usage of the SDK for real-time processing when the server and the application are on separate machines connected by a network because sending raw frames over gRPC has high latency.
Performance Reference#
The following table lists multistream performance of Triton-enabled SDKs in comparison with the performance of the non-Triton organic SDKs.
AR SDK - Face Detection
Throughput |
|||
|---|---|---|---|
GPU |
Organic SDK |
Triton-enabled SDK |
Throughput gain |
T4 |
75 |
136 |
81% |
A40 |
114 |
180 |
58% |
L40 |
138 |
237 |
72% |
B40 |
138 |
299 |
116% |
AR SDK - Facial Landmark Detection
Throughput |
|||
|---|---|---|---|
GPU |
Organic SDK |
Triton-enabled SDK |
Throughput gain |
T4 |
36 |
98 |
172% |
A40 |
54 |
150 |
177% |
L40 |
75 |
156 |
108% |
B40 |
60 |
276 |
360% |
AR SDK - Eye Contact
Throughput |
|||
|---|---|---|---|
GPU |
Organic SDK |
Triton-enabled SDK |
Throughput gain |
T4 |
8 |
13 |
62% |
A40 |
11 |
23 |
109% |
L40 |
16 |
26 |
62% |
B40 |
19 |
42 |
121% |
AR SDK - LipSync
Throughput |
|||
|---|---|---|---|
GPU |
Organic SDK |
Triton-enabled SDK |
Throughput gain |
T4 |
0 |
0 |
NA |
A40 |
0 |
0 |
NA |
L40 |
1 |
1 |
NA |
B40 |
1 |
1 |
NA |
VFX SDK - AI Green Screen
Throughput |
|||
|---|---|---|---|
GPU |
Organic SDK |
Triton-enabled SDK |
Throughput gain |
T4 |
13 |
12 |
-7% |
A40 |
27 |
32 |
18% |
L40 |
44 |
54 |
23% |
B40 |
37 |
57 |
54% |
VFX SDK - AIGS Relighting
Throughput |
|||
|---|---|---|---|
GPU |
Organic SDK |
Triton-enabled SDK |
Throughput gain |
T4 |
0 |
0 |
NA |
A40 |
0 |
0 |
NA |
L40 |
1 |
1 |
NA |
B40 |
1 |
1 |
NA |
Note
The input video resolution is 720p. AI Green Screen is using Performance mode. Facial Landmark Detection is computed with 126 landmark points detected on face in Performance mode.
Face Detection, Facial Landmark Detection, and Eye Contact have temporal flags on. The server and the client are on the same machine, and the data transfer uses CUDA Shared Memory.
Throughput is defined as the maximum number of streams to achieve 30 FPS in real time.