Basic Inference#
Perform a health check on the gRPC endpoint.
Install
grpcurlfrom github.com/fullstorydev/grpcurl/releases.Example commands to run on Ubuntu:
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.9.1/grpcurl_1.9.1_linux_amd64.deb sudo dpkg -i grpcurl_1.9.1_linux_amd64.deb
Download the health checking proto:
wget https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.protoRun the health check:
grpcurl --plaintext --proto health.proto localhost:8001 grpc.health.v1.Health/Check
If the service is ready, you get a response similar to the following:
{ "status": "SERVING" }
Note
For using grpcurl with an SSL-enabled server, avoid using the
--plaintextargument, and use--cacertwith a CA certificate,--keywith a private key, or--certwith a certificate file. For more details, refer togrpcurl --help.Download the Active Speaker Detection NIM client code by cloning the gRPC client repository (NVIDIA-Maxine/nim-clients):
git clone https://github.com/NVIDIA-Maxine/nim-clients.git cd nim-clients/active-speaker-detection/
For the Python client, install the required dependencies:
# Install Python on Linux sudo apt-get install python3-pip pip install -r requirements.txt
Compile the Protos (Optional)#
If you want to use the client code provided in the Maxine NIM clients GitHub repository (NVIDIA-Maxine/nim-clients), you can skip this step.
The proto files are available in the active-speaker-detection/protos folder. You can compile them to generate client interfaces in your preferred programming language. For more details, refer to Supported languages in the gRPC documentation.
The following example shows how to compile the protos for Python on Linux.
The grpcio version needed for compilation can be referred from requirements.txt
To compile protos on Linux:
cd active-speaker-detection/protos/linux
chmod +x compile_protos.sh
./compile_protos.sh
Input and Output#
The NVIDIA Active Speaker Detection NIM takes three types of input and produces per-frame detection results.
Inputs#
Video: An MP4 file with H.264 video encoding.
Audio: One of the following choices.
A separate audio file encoded as WAV, MP3, or Opus.
The
--skip-audiooption, which uses the embedded audio in the video container. (Refer to Audio Source Configuration.)
Diarization: A supported diarization format–based file. (Refer to Diarization Input Format.)
Output#
The NVIDIA Active Speaker Detection NIM returns per-frame detection results containing the following information:
Frame ID: The frame number in the video.
Speaker Data: A list of detected speakers for each frame, each containing the following:
speaker_bbox: Bounding box coordinates (x, y, width, height) around the detected speaker.diarized_speaker_id: The audio track–based speaker ID from the diarization data.face_id: A unique face ID assigned to each detected face.is_speaking: Boolean flag indicating whether the speaker is actively speaking in this frame.face_detection_confidence: Face detection confidence score (0.0 to 1.0).
The sample client visualizes results as bounding boxes overlaid on the output video. Speaking faces are shown with green boxes and non-speaking faces with red boxes, each labeled with face ID and speaker ID.
Diarization Input Format#
The client package ships with two JSON parsers in scripts/diarization.py:
RIVADiarizationParser: NVIDIA RIVA ASR diarized output.SampleDiarizationParser: Sample JSON with a top-levelwordsarray (see below).
The helper load_diarization tries registered parsers in order (RIVADiarizationParser first, then SampleDiarizationParser) and uses the first parser whose can_parse accepts the file content. If you pass diarization through code paths that call load_diarization, RIVA exports and the sample format are both recognized automatically.
To add support for additional diarization formats (such as other JSON layouts, CSV, or plain text), subclass the DiarizationParser base class in diarization.py, implement can_parse and parse, and register your parser in _PARSERS if you want it picked up by load_diarization.
RIVA JSON Shape (Illustrative)#
RIVA responses use nested results / alternatives / words with millisecond times and speakerTag:
{
"results": [
{
"alternatives": [
{
"transcript": "hello world",
"words": [
{
"word": "hello",
"startTime": 120,
"endTime": 380,
"speakerTag": 1,
"languageCode": "en-US",
"confidence": 0.95
}
]
}
]
}
]
}
Sample JSON Format#
The sample diarization JSON format uses a top-level object with a words array. Each entry contains text, start (seconds), end (seconds), and speaker_id. Optional top-level fields include text (full transcript) and language_code.
{
"language_code": "eng",
"text": "Perfect. Hey, welcome to the show.",
"words": [
{
"text": "Perfect.",
"start": 0.219,
"end": 0.62,
"speaker_id": "speaker_0"
},
{
"text": "Hey,",
"start": 0.62,
"end": 1.019,
"speaker_id": "speaker_0"
}
]
}
For generating diarization data from an audio stream, refer to NVIDIA Riva ASR Overview.
Running Inference via Python Script#
You can use the sample client script to send a gRPC request to the hosted NIM server:
Go to the Python scripts directory.
Run the command to send a gRPC request:
python active_speaker_detection.py \ --target <server_ip:port> \ --video-input <input video file path> \ --audio-input <input audio file path> \ --diarization-input <diarization JSON file path> \ --output <output file path>
Note
The first inference is not indicative of the model’s actual performance because it includes the time taken by the Triton Inference Server to load the models for the first time. The subsequent inference requests reflect the actual processing performance.
To view details of command-line arguments, run the following command:
python active_speaker_detection.py -h
You get a response similar to the following:
usage: active_speaker_detection.py [-h] [--ssl-mode {DISABLED,MTLS,TLS}] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--ssl-root-cert SSL_ROOT_CERT] [--target TARGET]
[--preview-mode] [--api-key API_KEY] [--function-id FUNCTION_ID] [--video-input VIDEO_INPUT] [--audio-input AUDIO_INPUT]
[--diarization-input DIARIZATION_INPUT] [--skip-audio] [--skip-diarization] [--output OUTPUT]
Run Active Speaker Detection inference with video, audio, and diarization files
Command-Line Arguments#
Argument |
Description |
Default |
|---|---|---|
|
IP:port of gRPC service. |
|
|
Path to input video file (MP4 format). |
|
|
Path to input audio file (WAV, MP3, or Opus format). |
|
|
Path to diarization JSON file with word-level speaker info. |
|
|
Path for the output video file with speaker bounding boxes. |
|
|
Skip sending separate audio; use audio embedded in the video stream. |
|
|
Send request to the preview NVCF NIM server. |
|
|
NGC API key for authentication. Required in preview mode. |
|
|
NVCF function ID for the service. Required in preview mode. |
|
|
SSL mode: |
|
|
Path to SSL private key (required for MTLS). |
|
|
Path to SSL certificate chain (required for MTLS). |
|
|
Path to SSL root certificate (required for TLS/MTLS). |
|
Example Commands#
Basic inference with default arguments:
python active_speaker_detection.pyRun inference with custom input files:
python active_speaker_detection.py \ --target 127.0.0.1:8001 \ --video-input ../assets/sample_video_streamable.mp4 \ --audio-input ../assets/sample_audio.wav \ --diarization-input ../assets/sample_diarization.json \ --output speaker_detection_output.mp4
Run inference using embedded audio (skip separate audio file):
python active_speaker_detection.py \ --target 127.0.0.1:8001 \ --video-input ../assets/sample_video_streamable.mp4 \ --diarization-input ../assets/sample_diarization.json \ --skip-audio
Supported Formats#
Input Type |
Supported Formats |
|---|---|
Video |
MP4 with H.264 encoding |
Audio |
Linear PCM WAV, MP3, or Opus |
Diarization |
JSON (Simple format or RIVA ASR format) |
Input Modes#
The Active Speaker Detection NIM provides two modes for processing input files: streaming and transactional.
Aspect |
Transactional Mode |
Streaming Mode |
|---|---|---|
Data Storage |
Entire video and audio files are temporarily copied on disk. |
Only frames being processed are temporarily copied in memory. |
Processing Start |
NIM waits to receive entire files before starting. |
NIM starts processing as soon as data chunk for first frame arrives. |
Processing Timing |
Processing begins after all data is received. |
Continuous processing without waiting for complete input. |
Diarization Data |
Complete diarization input is received with video and audio before processing begins. |
Diarization must be available at the time of the first inference request. If diarization is not supplied for that request, the session is treated as having no diarization, which can result in incorrect outputs. Provide diarization by interleaving it with stream data or by sending the diarization payload before the first inference call. |
Output Delivery |
Complete results are returned to client after inference is finished for whole video. |
Results are generated and returned immediately per frame. |
Streaming Mode#
Streaming mode is the recommended way to use Active Speaker Detection NIM. It allows inference to begin without receiving the whole video from the client. It processes video frames incrementally, and inference begins as soon as the first frame of information is available. The output frames are then streamed back to the client immediately after inference.
The NIM automatically detects streamable videos and enables streaming mode. This mode delivers the lowest latency and best resource efficiency, and it scales well to large files.
Use streaming mode for these use cases:
Best overall performance—the NIM is optimized for this path.
Streamable video inputs.
Applications that benefit from receiving output as it is generated, without waiting for the entire file to be uploaded.
Large video files that benefit from incremental processing and reduced disk I/O.
Streaming mode works with streamable videos in which metadata is positioned at the beginning of the file. The NIM automatically detects streamable videos and enables streaming mode. Videos that are not streamable can be easily converted to a streamable format.
To make any video streamable, use FFmpeg with the following command:
ffmpeg -i sample_video.mp4 -movflags +faststart sample_video_streamable.mp4
Streaming Mode and Diarization#
In streaming mode, diarization must be available at the time of the first inference request. If diarization is not supplied for that request, the session is treated as having no diarization, which can result in incorrect outputs.
Note
We recommend that you provide diarization data by interleaving it with stream data or by sending the diarization payload before the first inference call.
Transactional Mode#
In transactional mode, the video and audio files must be completely received by the NIM before processing can begin. This is the default mode; no flag must be set.
Transactional mode is suitable for the following use cases:
Processing of small video and audio files.
Applications that can wait for complete processing before receiving output.
Videos that are not optimized for streaming (such as non-streamable MP4 files in which metadata is located at the end of the file, requiring the entire file to be downloaded before playback can begin).
To run Active Speaker Detection in transactional mode, run the sample client:
python active_speaker_detection.py --target 127.0.0.1:8001 \
--video-input ../assets/sample_streamable.mp4 \
--audio-input ../assets/sample_audio.wav \
Usage for Preview API Request#
python active_speaker_detection.py --preview-mode \
--target grpc.nvcf.nvidia.com:443 \
--function-id <FUNCTION_ID> \
--api-key <NVCF_API_KEY> \
--video-input ../assets/sample_streamable.mp4 \
--output out.mp4
Replace <FUNCTION_ID> and <NVCF_API_KEY> with your assigned function ID and API key. In preview mode, the client connects over a secure gRPC channel to the NVCF endpoint.
Output Format#
The sample client produces an output video with speaker bounding boxes overlaid on the original frames. A three-color scheme is used to indicate speaker state:
Color |
State |
|---|---|
Green |
Face is actively speaking. |
Blue |
Face has an assigned audio track (diarized) but is not speaking. |
Red |
Face is tracked but has no audio track assigned. |
For more information about configuring advanced parameters such as maximum speakers, audio source mode, and debug mode, see Advanced Usage.