NVIDIA Maxine Triton Inference Server User Guide#

The NVIDIA Triton inference server (https://developer.nvidia.com/nvidia-triton-inference-server) is software that helps to deploy AI features in the cloud at scale. The Triton-enabled NVIDIA Maxine Augmented Reality (AR) and Video Effects (VFX) SDKs are well suited for servers and microservices because the SDKs use several Triton features to deliver

High throughput and resource utilization through dynamic batching, batched inference, and concurrent model execution.
Multistream support to process multiple video streams concurrently.
MultiGPU and MIG support.

With the Triton-enabled SDKs, the Maxine SDK features run fully on a Triton server and the SDK library provides C APIs for the user’s application to communicate with the server via gRPC. Because the server supports multi-stream processing, several applications can send workload to a single server at the same time, each in turn requesting processing on more than one input video stream.

The following features are included in the Triton-enabled AR SDK:

Face Detection
Facial Landmark Detection
Eye Contact
Video Live Portrait
Speech Live Portrait
LipSync

The following features are included in the Triton-enabled VFX SDK:

AI Green Screen
Background Blur
Video Relighting

The image data is transferred between the application and the server as raw frames using gRPC, or using shared memory in cases when the server and the application run on the same machine. This limits the usage of the SDK for real-time processing when the server and the application are on separate machines connected by a network because sending raw frames over gRPC has high latency.

Performance Reference#

The following table lists multistream performance of Triton-enabled SDKs in comparison with the performance of the non-Triton organic SDKs.

AR SDK - Face Detection

	Throughput
GPU	Organic SDK	Triton-enabled SDK	Throughput gain
T4	75	136	81%
A40	114	180	58%
L40	138	237	72%
B40	138	299	116%

AR SDK - Facial Landmark Detection

	Throughput
GPU	Organic SDK	Triton-enabled SDK	Throughput gain
T4	36	98	172%
A40	54	150	177%
L40	75	156	108%
B40	60	276	360%

AR SDK - Eye Contact

	Throughput
GPU	Organic SDK	Triton-enabled SDK	Throughput gain
T4	8	13	62%
A40	11	23	109%
L40	16	26	62%
B40	19	42	121%

AR SDK - LipSync

	Throughput
GPU	Organic SDK	Triton-enabled SDK	Throughput gain
T4	0	0	NA
A40	0	0	NA
L40	1	1	NA
B40	1	1	NA

VFX SDK - AI Green Screen

	Throughput
GPU	Organic SDK	Triton-enabled SDK	Throughput gain
T4	13	12	-7%
A40	27	32	18%
L40	44	54	23%
B40	37	57	54%

VFX SDK - AIGS Relighting

	Throughput
GPU	Organic SDK	Triton-enabled SDK	Throughput gain
T4	0	0	NA
A40	0	0	NA
L40	1	1	NA
B40	1	1	NA

Note

The input video resolution is 720p. AI Green Screen is using Performance mode. Facial Landmark Detection is computed with 126 landmark points detected on face in Performance mode.

Face Detection, Facial Landmark Detection, and Eye Contact have temporal flags on. The server and the client are on the same machine, and the data transfer uses CUDA Shared Memory.

Throughput is defined as the maximum number of streams to achieve 30 FPS in real time.