Overview#

NVIDIA Audio2Face-2D NIM uses generative models to create facial animations using a portrait photo and driving audio. The resulting animation synchronizes the mouth movements in the photo with the speech in the audio.

The model processes the input audio to estimate landmark motions that represent the mouth movements articulating the words in the audio. These landmarks are encoded into latent representations, which are passed to a generative model to animate the input portrait.

NVIDIA Audio2Face-2D NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

Architecture#

NVIDIA Audio2Face-2D uses audio as input to drive a 2D portrait image. The MFCC feature is extracted from the raw audio data and passed to an LSTM network, which produces 2D facial landmarks that animate mouth articulation. These facial landmarks can be further manipulated to create natural expressions, such as blinking, gaze, and mouth movements. The manipulated 2D landmarks, along with a given head pose, are processed by another LSTM network to obtain a 3D latent representation, which encapsulates the mouth and head movements, as well as information from the input image. This latent representation is then passed through a generative model, which generates photo-realistic animations of the facial landmarks and mouth articulation, matching the given audio.

Architecture Type: Recurrent Neural Network (RNN), Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs)

Network Architecture: Encoder-Decoder

Try It Out#

Try the NVIDIA Audio2Face-2D NIM at build.nvidia.com/nvidia/audio2face-2d.