Overview#

NVIDIA LipSync NIM is an AI-powered service that synchronizes lip movements in videos with input speech, creating naturally synchronized speech animations. The service processes both video and speech inputs to generate a seamlessly synchronized output video in which the subject’s lip movements accurately match the speech.

This NIM is designed for live broadcast and production workflows that ingest and egress uncompressed video and audio over SMPTE ST 2110 using NVIDIA Rivermax. It is deployed with Helm charts on a Holoscan for Media cluster and can optionally be managed through a Kubernetes operator.

Lip-synchronized SMPTE ST 2110-20 video from the NIM is transmitted downstream to the receiver. In the receiver pipeline, the video is muxed with SMPTE ST 2110-30 audio received from the sender, producing the final output.

Architecture#

The LipSync service is built on NVIDIA’s software platform:

  • Holoscan for Media: Media platform for SMPTE ST 2110–oriented live pipelines and Kubernetes-native orchestration.

  • NVIDIA Rivermax: GPU-accelerated SMPTE ST 2110 streaming.

  • NVIDIA DeepStream: GStreamer-based media pipelines and accelerated video preprocessing.

  • CUDA for GPU-accelerated processing.

  • TensorRT for optimized neural network inference.

  • Triton Inference Server for efficient model serving and scaling.

  • NVIDIA Augmented Reality (AR) SDK for face detection, temporal sequence processing, frame feature extraction, and LipSync inference.

Processing Flow#

Pipeline#

The following diagram summarizes how media flows through the LipSync NIM on Holoscan for Media:

LipSync Holoscan for Media architecture and processing flow

The NIM ingests a video stream along with a corresponding translated (target) audio stream and generates a lip-synchronized output video aligned to the provided audio.

The service supports the following operating modes:

  • Mode 1: Accepts video and audio inputs and produces a lip-synchronized video output.

  • Mode 2: Accepts video, audio, and ancillary data (Example: active-speaker bounding boxes for multi-speaker scenarios) and produces a lip-synchronized video output.

This document focuses on Mode 1 (single-speaker). Mode 2 requires integration with the NVIDIA Active Speaker Detection NIM to provide ancillary inputs such as speaker bounding boxes via the SMPTE ST 2110-40 ancillary data channel.

For the binary layout of optional ancillary data, including field offsets and sizes, refer to Ancillary Data Payload (SMPTE ST 2110-40).

Inputs and Outputs#

Direction

Stream Type

Transport

Format

Supported Configurations

Input

Video

SMPTE ST 2110-20

Uncompressed YCbCr 4:2:2 10-bit

Resolutions: 720p, 1080p, 4K UHD. Frame rates: 23.97 (24000/1001), 25, 29.97 (30000/1001), 30, 50, 59.94 (60000/1001), and 60 fps.

Input

Audio

SMPTE ST 2110-30

Mono PCM L24, 48 kHz

One stream.

Input (optional)

Ancillary data

SMPTE ST 2110-40

SMPTE 291M

Per-frame bounding boxes, speaker IDs, and confidence scores (required for multi-speaker).

Output

Video

SMPTE ST 2110-20

Matches input raster

Lip-synchronized video output.

Deployment#

Three deployment patterns are supported:

Deployment Option

Description

NIM Service Chart

Deploys the LipSync NIM as a standalone Kubernetes deployment for integration into a custom media pipeline.

Kubernetes Operator

Manages LipSync NIM workloads through a NvidiaLipsyncMediaFunction custom resource. Suitable for declarative, production-grade management.

End-to-End Demo Chart

Deploys a complete demo pipeline (sender, LipSync NIM service, and receiver) as a single Helm release. Demonstrates the full sender-to-receiver flow and is recommended for initial evaluation.

For installation instructions, see Installation.

Supported Operating Modes#

The solution supports the following industry-standard operating modes.

ST 2110 Static Mode#

Media streams such as video, audio, and ancillary data are transported over IP networks using predefined multicast IP addresses and ports. Configuration is performed manually, and IP address and port settings must match between sending and receiving devices to ensure proper media flow.

NMOS Mode#

Devices and media streams are automatically discovered and connected using AMWA NMOS standards. This mode simplifies system setup by removing the need for manual IP address and port configuration.

See Also#