Advanced Usage#

Model Caching#

When the container launches for the first time, it downloads the required models from NGC. To avoid downloading the models on subsequent runs, you can cache them locally by using a cache directory:

# Create the cache directory on the host machine
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 $LOCAL_NIM_CACHE

# Choose manifest profile ID based on target architecture.
export MANIFEST_PROFILE_ID=<enter_valid_manifest_profile_id>

# Run the container with the cache directory mounted in the appropriate location
docker run -it --rm --name=lipsync-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_MANIFEST_PROFILE=$MANIFEST_PROFILE_ID \
  -e NIM_HTTP_API_PORT=8000 \
  -p 8000:8000 \
  -p 8001:8001 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  nvcr.io/nim/nvidia/lipsync:latest

For more information about MANIFEST_PROFILE_ID, refer to Model Manifest Profiles.

SSL enablement#

Lipsync NIM provides an SSL mode to ensure secure communication between clients and the server by encrypting data in transit. To enable SSL, you must provide the path to the SSL certificate and key files in the container. The following example shows how to do this:

export NGC_API_KEY=<add-your-api-key>
SSL_CERT=path/to/ssl_key

docker run -it --rm --name=lipsync-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -v $SSL_CERT:/opt/nim/crt/:ro \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  -p 8001:8001 \
  -e NIM_SSL_MODE="mtls" \
  -e NIM_SSL_CA_CERTS_PATH="/opt/nim/crt/ssl_ca.pem" \
  -e NIM_SSL_CERT_PATH="/opt/nim/crt/ssl_cert_server.pem" \
  -e NIM_SSL_KEY_PATH="/opt/nim/crt/ssl_key_server.pem" \
  nvcr.io/nim/nvidia/lipsync:latest

NIM_SSL_MODE can be set to mtls, tls, or disabled. If set to mtls, the container uses mutual TLS authentication. If set to tls, the container uses TLS authentication. For more information, refer to Environment Variables.

Be sure to verify the permissions of the SSL certificate and key files on the host machine. The container cannot access the files if they are not readable by the user running the container.

NIM Service Configuration Parameters via Client#

The following options can be configured by each request to the Lipsync NIM:

Video-Audio Alignment Options#

  • extend_audio: Controls how to handle cases where video is longer than audio. This parameter is useful when working with content where the video track extends beyond the audio duration, such as when processing silent video segments, incomplete audio recordings, or when synchronizing content with varying durations.

    • EXTEND_AUDIO_UNSPECIFIED (default): Truncates video to match audio length.

    • EXTEND_AUDIO_SILENCE: Adds silent audio padding to keep audio and video synchronized.

    This option can be configured by setting the extend_audio parameter to EXTEND_AUDIO_SILENCE in the configuration message when making requests to the NIM.

    The following example uses the sample Python client to extend audio with custom options for handling mismatched input video and audio durations:

    python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --extend-audio silence
    
  • extend_video: Controls how to handle cases where audio is longer than video. This parameter is useful when working with content where the audio track extends beyond the video duration, such as when adding voiceovers, dubbing content etc.

    • EXTEND_VIDEO_UNSPECIFIED (default): Keeps original video length and truncates longer audio.

    • EXTEND_VIDEO_FORWARD: Extends video duration by repeating the final 5 seconds of frames in forward order until matching audio length.

    • EXTEND_VIDEO_REVERSE: Extends video duration by repeating the final 5 seconds of frames in reverse order until matching audio length.

    This option can be configured by setting the extend_video to EXTEND_VIDEO_FORWARD or EXTEND_VIDEO_REVERSE in the configuration message when making requests to the NIM.

    The following example uses the sample Python client to extend video with custom options for handling mismatched input video and audio durations:

    python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --extend-video reverse
    

Encoding Options#

The output video encoding quality can be configured through the output_video_encoding parameter in the following ways:

  • Lossless encoding: Provides maximum quality with no compression artifacts, preserving the original video quality pixel-for-pixel. This mode maintains the highest fidelity but results in significantly larger file sizes. Use this mode when quality is the top priority.

    • To configure this option in the configuration message when making requests to the NIM, set output_video_encoding to VideoEncoding(lossless=True).

    • To run LipSync with lossless encoding (which overrides the bitrate setting) via the sample client, use the --lossless option:

      python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --lossless
      
  • Bitrate control: Allows balancing quality and file size by specifying the video bitrate in megabits per second (Mbps). The default is 30 Mbps. Higher bitrates produce better quality but larger file sizes; lower bitrates reduce file size at the cost of quality.

    • To configure this option in the configuration message when making requests to the NIM, set output_video_encoding to VideoEncoding(lossy=LossyEncoding(bitrate_mbps=<desired-bitrate-in-mbps>)).

    • To run LipSync with a desired output bitrate via the sample client, use the --bitrate option:

      python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --bitrate 20
      
  • IDR interval control:: Sets the interval between Instantaneous Decoder Refresh (IDR) frames (default: 8) for seeking and random-access capabilities. Lower values improve seeking accuracy, random access, and overall encoding quality but increase file size; higher values reduce file size but can impact seeking performance and quality.

    • To configure this option in the configuration message when making requests to the NIM, set output_video_encoding to VideoEncoding(lossy=LossyEncoding(idr_interval=<desired_idr_interval>)).

    • To run LipSync with a desired output bitrate via the sample client, use the --idr_interval option:

      python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --idr-interval 10
      
  • Custom encoding parameters: Provides fine-grained control for expert users via JSON configuration. These parameters configure properties of the DeepStream H264 encoder.

    To specify custom encoding parameters, set output_video_encoding to custom_encoding in the configuration message sent to the NIM in every request:

    custom_encoding = custom {
      bitrate:5000000,
      idr_interval:16,
      maxbitrate: 6000000
    }
    

    Note

    Custom encoding parameters override standard bitrate and IDR interval settings. Use with caution because incorrect values can affect output quality or encoding stability.

Speaker Data Option#

The speaker_data_input option specifies the path to a JSON file that defines per-frame speaker bounding box information for targeted lip synchronization. This option is specifically designed to support multi-speaker scenarios by letting you specify the active speaker to lip-sync on a per-frame basis. The file contains bounding-box coordinates (x, y, width, height), speaker identity, and speaking status for each frame. If not provided, the LipSync NIM automatically detects faces in each frame using its built-in face detection capabilities.

When providing speaker data, set is_speaker_info_provided to True in the configuration message when making requests to the NIM.

The SpeakerInfo structure contains the following fields:

Field

Type

Description

speaker_bbox

BoundingBox

Defines the bounding box coordinates for the speaker’s face region.

speaker_id

int32

Unique identifier for this speaker across frames.

is_speaking

bool

Flag indicating whether the speaker is currently speaking.

The BoundingBox structure contains the following fields:

Field

Type

Description

x

float

X-coordinate of the top-left corner of the bounding box.

y

float

Y-coordinate of the top-left corner of the bounding box.

width

float

Width of the bounding box in pixels.

height

float

Height of the bounding box in pixels.

The SpeakerInfoPerFrame structure wraps per-frame data:

Field

Type

Description

frame_id

uint32

Frame index in the video.

speaker_infos

repeated SpeakerInfo

List of all speakers in this frame.

bypass

bool (optional)

If set to true, LipSync processing is bypassed for this frame and the original frame is returned unchanged. If false or unset, the frame is processed normally. This field is useful for cutscenes or frames that contain no subject.

When providing speaker data, you must send a SpeakerInfoPerFrame object for every frame in the input video. For multi-speaker scenarios, each frame can contain multiple speakers.

JSON File Format for Speaker Data

When creating a JSON file for speaker data, use the following format. The file must contain a top-level frames array with one entry per frame, as in the following example:

{
  "frames": [
    {
      "speakers": [
        {
          "bbox": [186, 191, 175, 254],
          "speaker_id": 1,
          "is_speaking": false
        },
        {
          "bbox": [815, 188, 263, 357],
          "speaker_id": 2,
          "is_speaking": true
        }
      ]
    },
    {
      "speakers": [
        {
          "bbox": [188, 191, 174, 254],
          "speaker_id": 1,
          "is_speaking": false
        }
      ]
    }
  ]
}

Each frame entry contains a speakers array with the following values:

  • bbox: Array of four values [x, y, width, height] defining the bounding box of the speaker’s face in pixel coordinates.

  • speaker_id: Integer identifier for tracking this face across frames.

  • is_speaking: Boolean indicating whether the speaker is currently speaking.

If the speakers array is empty or absent for a given frame, the server auto-detects faces for that frame.

Obtaining Bounding Box Coordinates

To create the speaker data JSON file, you need to obtain bounding-box coordinates for each frame. You can use an external face-detection system to generate these coordinates.

Note

The quality and accuracy of your face detection directly affects the LipSync results. For optimal lip-synchronization performance, ensure that your bounding boxes accurately encompass the facial regions.

To run LipSync NIM with a speaker data file via the sample client, use the --speaker-data-input option:

python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --speaker-data-input /path/to/speaker_data.json

Head Movement Speed#

The head_movement_speed parameter controls the expected speed of head movement in the input video:

  • 0: Static or slow-moving head (default).

  • 1: Fast-moving head.

This parameter helps the model optimize lip synchronization based on the dynamics of the subject’s head movement.

To run LipSync with head movement speed via the sample client, use the --head-movement-speed option:

python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --head-movement-speed 1

Output Audio Codec#

The output_audio_codec parameter specifies the audio codec used in the output video file:

  • opus: Opus audio codec (default).

  • mp3: MP3 audio codec.

To run LipSync with a specific output audio codec via the sample client, use the --output-audio-codec option:

python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --output-audio-codec mp3

Background Audio Mixing#

The LipSync NIM supports mixing background audio with the output to preserve ambient sounds. This is useful when the original video contains background music or environmental sounds that should be retained in the output.

The background audio configuration includes the following options:

  • --mix-background-audio: Flag to enable background audio mixing.

  • --background-audio-input: Path to the background audio file (WAV or MP3).

  • --background-audio-volume: Volume level for the background audio (0.0 to 1.0; default: 0.5).

    • 0.0: Background audio is not included.

    • 1.0: Background audio is included at its full, original volume.

    • Values between 0.0 and 1.0: Background volume is reduced proportionally.

Note

The background audio and the speech audio must have the same sample rate.

To run LipSync with background audio mixing via the sample client:

python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav --mix-background-audio --background-audio-input /path/to/background.wav --background-audio-volume 0.3

Python Configuration Example#

The following example shows how to set configuration parameters while sending an inference request from a Python client to the LipSync NIM:

import nvidia.ai4m.lipsync.v1.lipsync_pb2 as lipsync_pb2
import nvidia.ai4m.video.v1.video_pb2 as video_pb2
import nvidia.ai4m.audio.v1.audio_pb2 as audio_pb2

params = {
    "input_audio_codec": audio_pb2.AudioCodec.AUDIO_CODEC_WAV,
    "extend_audio": lipsync_pb2.ExtendAudio.EXTEND_AUDIO_SILENCE,
    "extend_video": lipsync_pb2.ExtendVideo.EXTEND_VIDEO_REVERSE,
    "output_video_encoding": video_pb2.VideoEncoding(lossless=True),
    "is_speaker_info_provided": True,
    "output_audio_codec": audio_pb2.AudioCodec.AUDIO_CODEC_OPUS,
    "head_movement_speed": 0,
}

yield lipsync_pb2.LipsyncRequest(config=lipsync_pb2.LipsyncConfig(**params))

Debug Mode#

The LipSync NIM includes a debug mode that provides visual feedback during processing. When enabled, diagnostic overlays are rendered directly onto each output video frame, making it easier to verify effect behavior and troubleshoot issues.

To enable debug mode, set the environment variable LIPSYNC_DEBUG_MODE=1 when launching the NIM container:

docker run -it --rm --name=lipsync-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e LIPSYNC_DEBUG_MODE=1 \
  -p 8000:8000 \
  -p 8001:8001 \
  nvcr.io/nim/nvidia/lipsync:latest

When debug mode is enabled, the following overlays appear on each frame:

Overlay

Description

Frame number

Displayed at the top-center of each frame with a white background.

LipSync effect status

A LIPSYNC ON or LIPSYNC OFF indicator below the frame number. The background is green when the effect is active and red when it is bypassed.

Activation bounding box

A square bounding box showing where the LipSync effect is being applied. The box is green when the effect is active (strength > 0) and red when bypassed (strength = 0). Box coordinates and dimensions are labeled above the box.

Speaker bounding boxes

White bounding boxes for each speaker, shown only when speaker data is provided. Each box is labeled with a speaker identifier (e.g., S0, S1) and shows [speaking] when the speaker is actively speaking.

Debug mode is useful for:

  • Verifying that speaker bounding boxes are correctly positioned.

  • Confirming that the LipSync effect is being applied to the intended regions.

  • Troubleshooting cases where the effect appears inactive or misaligned.

For more information, refer to Runtime Parameters for the Container.

For more information about AI for Media NIM clients, refer to the GitHub repository NVIDIA-Maxine/nim-clients.