NVIDIA Maxine Triton Inference Server User Guide#

The NVIDIA Triton inference server (https://developer.nvidia.com/nvidia-triton-inference-server) is software that helps to deploy AI features in the cloud at scale. The Triton-enabled NVIDIA Maxine Augmented Reality (AR) and Video Effects (VFX) SDKs are well suited for servers and microservices because the SDKs use several Triton features to deliver

  • High throughput and resource utilization through dynamic batching, batched inference, and concurrent model execution.

  • Multistream support to process multiple video streams concurrently.

  • MultiGPU and MIG support.

With the Triton-enabled SDKs, the Maxine SDK features run fully on a Triton server and the SDK library provides C APIs for the user’s application to communicate with the server via gRPC. Because the server supports multi-stream processing, several applications can send workload to a single server at the same time, each in turn requesting processing on more than one input video stream.

The following features are included in the Triton-enabled AR SDK:

  • Face Detection

  • Facial Landmark Detection

  • Eye Contact

  • Video Live Portrait

  • Speech Live Portrait

  • LipSync

The following features are included in the Triton-enabled VFX SDK:

  • AI Green Screen

  • Background Blur

  • Video Relighting

The image data is transferred between the application and the server as raw frames using gRPC, or using shared memory in cases when the server and the application run on the same machine. This limits the usage of the SDK for real-time processing when the server and the application are on separate machines connected by a network because sending raw frames over gRPC has high latency.

Performance Reference#

The following table lists multistream performance of Triton-enabled SDKs in comparison with the performance of the non-Triton organic SDKs.

AR SDK - Face Detection

Throughput

GPU

Organic SDK

Triton-enabled SDK

Throughput gain

T4

75

136

81%

A40

114

180

58%

L40

138

237

72%

B40

138

299

116%

AR SDK - Facial Landmark Detection

Throughput

GPU

Organic SDK

Triton-enabled SDK

Throughput gain

T4

36

98

172%

A40

54

150

177%

L40

75

156

108%

B40

60

276

360%

AR SDK - Eye Contact

Throughput

GPU

Organic SDK

Triton-enabled SDK

Throughput gain

T4

8

13

62%

A40

11

23

109%

L40

16

26

62%

B40

19

42

121%

AR SDK - LipSync

Throughput

GPU

Organic SDK

Triton-enabled SDK

Throughput gain

T4

0

0

NA

A40

0

0

NA

L40

1

1

NA

B40

1

1

NA

VFX SDK - AI Green Screen

Throughput

GPU

Organic SDK

Triton-enabled SDK

Throughput gain

T4

13

12

-7%

A40

27

32

18%

L40

44

54

23%

B40

37

57

54%

VFX SDK - AIGS Relighting

Throughput

GPU

Organic SDK

Triton-enabled SDK

Throughput gain

T4

0

0

NA

A40

0

0

NA

L40

1

1

NA

B40

1

1

NA

Note

The input video resolution is 720p. AI Green Screen is using Performance mode. Facial Landmark Detection is computed with 126 landmark points detected on face in Performance mode.

Face Detection, Facial Landmark Detection, and Eye Contact have temporal flags on. The server and the client are on the same machine, and the data transfer uses CUDA Shared Memory.

Throughput is defined as the maximum number of streams to achieve 30 FPS in real time.