Basic Inference#

Perform a health check on the gRPC endpoint.
- Install grpcurl from github.com/fullstorydev/grpcurl/releases.
  
  Example commands to run on Ubuntu:
```
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.9.1/grpcurl_1.9.1_linux_amd64.deb
sudo dpkg -i grpcurl_1.9.1_linux_amd64.deb
```
- Download the health checking proto:
```
wget https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.proto
```
- Run the health check:
```
grpcurl --plaintext --proto health.proto localhost:8001 grpc.health.v1.Health/Check
```
  If the service is ready, you get a response similar to the following:
```
{ "status": "SERVING" }
```
Note

For using grpcurl with an SSL-enabled server, avoid using the --plaintext argument, and use --cacert with a CA certificate, --key with a private key, or --cert with a certificate file. For more details, refer to grpcurl --help.

Download the LipSync client code by cloning the gRPC client repository (NVIDIA-Maxine/nim-clients):

git clone https://github.com/NVIDIA-Maxine/nim-clients.git

# Go to the 'lipsync' folder
cd nim-clients/lipsync/

Install the required dependencies.

The client requires Python 3.12 or later. For download and installation instructions, see Download Python.

For the Python client:
```
pip install -r requirements.txt
```
The following packages are required:
- grpcio==1.67.1
- grpcio-tools==1.67.1
- tqdm==4.67.1

Compile the Protos (Optional)#

If you want to use the client code provided in the NIM clients GitHub repository (NVIDIA-Maxine/nim-clients), you can skip this step.

The proto files are available in the lipsync/protos/proto folder, organized into the following packages:

nvidia/ai4m/lipsync/v1/lipsync.proto: Main LipSync service definition.
nvidia/ai4m/audio/v1/audio.proto: Audio codec definitions.
nvidia/ai4m/video/v1/video.proto: Video encoding definitions.
nvidia/ai4m/common/v1/common.proto: Common types (BoundingBox).

You can compile them to generate client interfaces in your preferred programming language. For more details, refer to Supported languages in the gRPC documentation.

The following example shows how to compile the protos for Python on Linux and Windows.

Python#

The grpcio version needed for compilation can be referred from requirements.txt

To compile protos on Linux, run the following commands:

# Go to lipsync/protos/linux/ folder
cd lipsync/protos/linux/

chmod +x compile_protos.sh
./compile_protos.sh

To compile protos on Windows, run the following commands:

# Go to lipsync/protos/windows/ folder
cd lipsync/protos/windows/

./compile_protos.bat

Running Inference via Python Script#

You can use the sample client script in the LipSync GitHub repo to send a gRPC request to the hosted NIM server:

Go to the Python scripts directory.
```
cd scripts
```

Run the command to send a gRPC request. (All command-line parameters are optional.)

python lipsync.py --target <server_ip:port> --video-input <input video file path> --audio-input <input audio file path> --output <output file path and the file name>

The LipSync NIM is optimized for streaming. When it detects a streamable video, it automatically enables streaming mode for lower latency and better resource efficiency. We recommend using streamable videos as input. For more information, see Streaming Mode.
```
python lipsync.py --target <server_ip:port> --video-input <streamable input video file path> --audio-input <input audio file path> --output <output file path and the file name>
```

To run with SSL enabled, add the SSL arguments.

python lipsync.py --target <server_ip:port> --video-input <input video file path> --audio-input <input audio file path> --output <output file path and the file name> --ssl-mode <ssl mode value> --ssl-key <ssl key file path> --ssl-cert <ssl cert filepath> --ssl-root-cert <ssl root cert filepath>

Note

The first inference is not indicative of the model’s actual performance because it includes the time taken by the Triton Inference Server to load the models for the first time. The subsequent inference requests reflect the actual processing performance.

For Blackwell GPUs, the initial inference might time out because of the time needed to load the models. If timeout occurs, send another request; subsequent inferences reflect the actual processing performance.

To view details of command-line arguments, run the following command:

python lipsync.py -h

You get a response similar to the following:

usage: lipsync.py [-h] [--ssl-mode {DISABLED,MTLS,TLS}] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--ssl-root-cert SSL_ROOT_CERT] [--target TARGET] [--video-input VIDEO_INPUT] [--audio-input AUDIO_INPUT]
                  [--speaker-data-input SPEAKER_DATA_INPUT] [--extend-audio {unspecified,silence}] [--extend-video {unspecified,forward,reverse}] [--bitrate BITRATE] [--idr-interval IDR_INTERVAL]
                  [--lossless] [--custom-encoding-params CUSTOM_ENCODING_PARAMS] [--output OUTPUT] [--output-audio-codec OUTPUT_AUDIO_CODEC] [--head-movement-speed HEAD_MOVEMENT_SPEED]
                  [--mix-background-audio] [--background-audio-input BACKGROUND_AUDIO_INPUT] [--background-audio-volume BACKGROUND_AUDIO_VOLUME]

Run LipSync inference with input video and audio files

options:
  -h, --help                                           show this help message and exit
  --ssl-mode {DISABLED,MTLS,TLS}                       Flag to set SSL mode, default is DISABLED (default: DISABLED)
  --ssl-key SSL_KEY                                    The path to ssl private key. (default: ../ssl_key/ssl_key_client.pem)
  --ssl-cert SSL_CERT                                  The path to ssl certificate chain. (default: ../ssl_key/ssl_cert_client.pem)
  --ssl-root-cert SSL_ROOT_CERT                        The path to ssl root certificate. (default: ../ssl_key/ssl_ca_cert.pem)
  --target TARGET                                      IP:port of gRPC service, when hosted locally. (default: 127.0.0.1:8001)
  --video-input VIDEO_INPUT                            The path to the input video file. (default: ../assets/sample_video.mp4)
  --audio-input AUDIO_INPUT                            The path to the input audio file. (default: ../assets/sample_audio.wav)
  --speaker-data-input SPEAKER_DATA_INPUT            Path to JSON file containing speaker data (bounding boxes, speaker_id, is_speaking). (default: None)
  --extend-audio {unspecified,silence}                 How to handle audio extension (default: unspecified) (default: unspecified)
  --extend-video {unspecified,forward,reverse}         How to handle video extension (default: unspecified) (default: unspecified)
  --bitrate BITRATE                                    Output video bitrate in Mbps (default: 30). This is only applicable when lossless mode is disabled. (default: 30)
  --idr-interval IDR_INTERVAL                          The interval for IDR frames in the output video. This is only applicable when lossless mode is disabled. (default: 8) (default: 8)
  --lossless                                           Flag to enable lossless mode for video encoding. (default: False)
  --custom-encoding-params CUSTOM_ENCODING_PARAMS      Custom encoding parameters in JSON format. (default: None)
  --output OUTPUT                                      The path for the output video file. (default: lipsync_output.mp4)
  --output-audio-codec OUTPUT_AUDIO_CODEC              Audio codec for output video file (opus/mp3). (default: opus)
  --head-movement-speed HEAD_MOVEMENT_SPEED            Speed of head movement in input video. 0 for static/slow-moving head, 1 for fast-moving head. (default: None)
  --mix-background-audio                               Mix background audio with the output audio. (default: False)
  --background-audio-input BACKGROUND_AUDIO_INPUT      Path to background audio file (wav or mp3). (default: None)
  --background-audio-volume BACKGROUND_AUDIO_VOLUME    Volume of the background audio (0.0 to 1.0). Default: 0.5. (default: 0.5)

To get more information about how to run the client with these parameters, refer to the NIM clients GitHub repository: NVIDIA-Maxine/nim-clients.

For more information about configuring advanced parameters such as bitrate, IDR interval, video and audio extension options, speaker data, background audio, and head movement speed, see Advanced Usage.

For command-line arguments that aren’t specified, the script uses the following default values.

Default Command-Line Arguments#

Argument	Default Value
`target`	`127.0.0.1:8001`
`video-input`	`../assets/sample_video.mp4`
`audio-input`	`../assets/sample_audio.wav`
`speaker-data-input`	`None`
`extend-video`	`unspecified`
`extend-audio`	`unspecified`
`bitrate`	`30` (Mbps)
`idr-interval`	`8`
`custom-encoding-params`	`None`
`output`	`lipsync_output.mp4` in the current directory
`output-audio-codec`	`opus`
`head-movement-speed`	`None`
`mix-background-audio`	`False`
`background-audio-input`	`None`
`background-audio-volume`	`0.5`
`ssl-mode`	`DISABLED`
`ssl-key`	`../ssl_key/ssl_key_client.pem`
`ssl-cert`	`../ssl_key/ssl_cert_client.pem`
`ssl-root-cert`	`../ssl_key/ssl_ca_cert.pem`

Example Commands#

Basic inference with default arguments:

python3 lipsync.py --target 127.0.0.1:8001

Run inference with custom input files:

python lipsync.py --target 127.0.0.1:8001 --video-input /path/to/video.mp4 --audio-input /path/to/audio.wav

Run inference with advanced configuration:

python lipsync.py --target 127.0.0.1:8001 --bitrate 20 --idr-interval 10 --extend-audio silence --extend-video reverse

Supported Formats#

The supported formats for input audio are WAV and MP3. The supported format for input video is MP4 with H.264 video encoding. Videos must use Constant Frame Rate (CFR); Variable Frame Rate (VFR) is not supported.

Input Modes#

The LipSync NIM provides two modes for processing input files: streaming and transactional.

Aspect	Streaming Mode	Transactional Mode
Data Storage	Only frames being processed are temporarily copied in memory.	Entire video and audio files are temporarily copied on disk.
Processing Start	NIM starts processing as soon as data chunk for first frame arrives.	NIM waits to receive entire files before starting.
Processing Timing	Continuous processing without waiting for complete input.	Processing begins after all data is received.
Output Delivery	Output frames are generated and returned immediately.	Complete output video is returned to client after inference is finished for whole video.

Streaming Mode#

Streaming mode is the recommended way to use LipSync NIM. It allows inference to begin without receiving the whole video from the client. It processes video frames incrementally, and inference begins as soon as the first frame of information is available. The output frames are streamed back to the client immediately after inference.

The NIM automatically detects streamable videos and enables streaming mode. This mode delivers the lowest latency and best resource efficiency, and it scales well to large files.

Use streaming mode for these use cases:

Best overall performance—the NIM is optimized for this path.
Streamable video inputs.
Applications that benefit from receiving output as it is generated, without waiting for the entire file to be uploaded.
Large video files that benefit from incremental processing and reduced disk I/O.

Streaming mode works with streamable videos where metadata is positioned at the beginning of the file. The NIM automatically detects streamable videos and enables streaming mode. Videos that are not streamable can be easily converted to a streamable format.

To make any video streamable, use FFmpeg with the following command:

ffmpeg -i sample_video.mp4 -movflags +faststart sample_video_streamable.mp4

You can then specify the streamable video as input to the NIM by using the --video-input parameter.

   python lipsync.py --target 127.0.0.1:8001 --video-input ../assets/sample_video_streamable.mp4 --audio-input ../assets/sample_audio.wav 

Transactional Mode#

In transactional mode, the NIM must receive the entire video and audio files before processing can begin. The NIM falls back to this mode when the input video is not streamable.

Transactional mode is suitable for the following use cases:

Videos that are not optimized for streaming (such as non-streamable MP4 files where metadata is located at the end of the file, requiring the entire file to be downloaded before playback can begin).
Processing of small video and audio files where streaming overhead is unnecessary.
Applications that can wait for complete processing before receiving output.

To run LipSync in transactional mode, provide a non-streamable video as input:

   python lipsync.py --target 127.0.0.1:8001 --video-input ../assets/sample_video.mp4  --audio-input ../assets/sample_audio.wav 

Tip

For best performance, convert your videos to a streamable format so that the NIM can use streaming mode. See the FFmpeg command earlier on this page.