Overview#

NVIDIA LipSync NIM is an AI-powered service that synchronizes lip movements in videos with input speech, creating naturally synchronized speech animations. The service processes both video and speech input to generate a seamlessly synchronized output video in which the subject’s lip movements accurately match the speech.

Key Features#

  • Flexible Input Handling: Supports separate video and speech inputs.

  • Configurable Output: Offers customizable video bitrates and audio codecs for output video.

  • Speaker Data Support: Accepts per-frame speaker bounding-box information via JSON to specify facial regions for targeted lip synchronization, including multi-speaker scenarios.

  • Alignment Options

    • Video extension modes for handling audio longer than video.

    • Audio extension modes for silence padding.

  • Background Audio Mixing: Mix background audio with the output to preserve ambient sounds.

  • Head Movement Speed: Configurable head-movement speed parameter for static or fast-moving heads.

  • Output Audio Codec: Configurable output audio codec (OPUS or MP3).

  • High-Quality Output: Maintains original video quality while adjusting only lip movements. Supports lossless, lossy, and custom video-encoding options.

Architecture#

The LipSync service is built on NVIDIA’s software platform:

  • CUDA for GPU-accelerated processing.

  • TensorRT for optimized neural network inference.

  • Triton Inference Server for efficient model serving and scaling.

Architecture Type

  • Recurrent Neural Network (RNN) for temporal sequence processing.

  • Convolutional Neural Networks (CNNs) for frame feature extraction.

  • Generative Adversarial Networks (GANs) for realistic lip movement synthesis.

Network Architecture

The system uses an encoder-decoder architecture that performs the following actions:

  1. Processes input video frames and audio samples.

  2. Synchronizes the timing between audio and video.

  3. Generates natural lip movements that match the audio input.

The service maintains frame-accurate synchronization while preserving the original video quality, making it ideal for content creation, dubbing, and video-localization applications.

Input Modes#

The LipSync NIM supports two modes of operation:

  • Streaming mode (recommended)

    In this mode, the input files are streamed to the NIM in chunks. As the chunks arrive, NIM runs inference incrementally and streams the output back to the client in chunks, even before the whole input file is received by the NIM. The NIM automatically detects streamable videos and enables this mode. For best performance, use streamable videos as input.

  • Transactional mode

    In this mode, whole input files are uploaded to the NIM before the inference begins. The inference is run on the input files to obtain the output file, which is then sent back to the client. Use this mode for non-streamable input videos or applications that require complete file upload before processing.

For more details, refer to the Input Modes section of Basic Inference.