Basic Inference#

Perform a health check on the gRPC endpoint.
- Install grpcurl from github.com/fullstorydev/grpcurl/releases.
  
  Example commands to run on Ubuntu:
```
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.9.1/grpcurl_1.9.1_linux_amd64.deb
sudo dpkg -i grpcurl_1.9.1_linux_amd64.deb
```
- Download the health checking proto:
```
wget https://raw.githubusercontent.com/grpc/grpc/master/src/proto/grpc/health/v1/health.proto
```
- Run the health check:
```
grpcurl --plaintext --proto health.proto localhost:8001 grpc.health.v1.Health/Check
```
  If the service is ready, you get a response similar to the following:
```
{ "status": "SERVING" }
```
Note

For using grpcurl with an SSL-enabled server, avoid using the --plaintext argument, and use --cacert with a CA certificate, --key with a private key, or --cert with a certificate file. For more details, refer to grpcurl --help.
Download the Active Speaker Detection NIM client code by cloning the gRPC client repository (NVIDIA-Maxine/nim-clients):
```
git clone https://github.com/NVIDIA-Maxine/nim-clients.git

cd nim-clients/active-speaker-detection/
```

For the Python client, install the required dependencies:

# Install Python on Linux
sudo apt-get install python3-pip
pip install -r requirements.txt

Compile the Protos (Optional)#

If you want to use the client code provided in the Maxine NIM clients GitHub repository (NVIDIA-Maxine/nim-clients), you can skip this step.

The proto files are available in the active-speaker-detection/protos folder. You can compile them to generate client interfaces in your preferred programming language. For more details, refer to Supported languages in the gRPC documentation.

The following example shows how to compile the protos for Python on Linux.

The grpcio version needed for compilation can be referred from requirements.txt

To compile protos on Linux:

cd active-speaker-detection/protos/linux

chmod +x compile_protos.sh
./compile_protos.sh

Input and Output#

The NVIDIA Active Speaker Detection NIM takes three types of input and produces per-frame detection results.

Inputs#

Video: An MP4 file with H.264 video encoding.
Audio: One of the following choices.
- A separate audio file encoded as WAV, MP3, or Opus.
- The --skip-audio option, which uses the embedded audio in the video container. (Refer to Audio Source Configuration.)
Diarization: A supported diarization format–based file. (Refer to Diarization Input Format.)

Output#

The NVIDIA Active Speaker Detection NIM returns per-frame detection results containing the following information:

Frame ID: The frame number in the video.
Speaker Data: A list of detected speakers for each frame, each containing the following:
- speaker_bbox: Bounding box coordinates (x, y, width, height) around the detected speaker.
- diarized_speaker_id: The audio track–based speaker ID from the diarization data.
- face_id: A unique face ID assigned to each detected face.
- is_speaking: Boolean flag indicating whether the speaker is actively speaking in this frame.
- face_detection_confidence: Face detection confidence score (0.0 to 1.0).

The sample client visualizes results as bounding boxes overlaid on the output video. Speaking faces are shown with green boxes and non-speaking faces with red boxes, each labeled with face ID and speaker ID.

Diarization Input Format#

The client package ships with two JSON parsers in scripts/diarization.py:

RIVADiarizationParser: NVIDIA RIVA ASR diarized output.
SampleDiarizationParser: Sample JSON with a top-level words array (see below).

The helper load_diarization tries registered parsers in order (RIVADiarizationParser first, then SampleDiarizationParser) and uses the first parser whose can_parse accepts the file content. If you pass diarization through code paths that call load_diarization, RIVA exports and the sample format are both recognized automatically.

To add support for additional diarization formats (such as other JSON layouts, CSV, or plain text), subclass the DiarizationParser base class in diarization.py, implement can_parse and parse, and register your parser in _PARSERS if you want it picked up by load_diarization.

RIVA JSON Shape (Illustrative)#

RIVA responses use nested results / alternatives / words with millisecond times and speakerTag:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "hello world",
          "words": [
            {
              "word": "hello",
              "startTime": 120,
              "endTime": 380,
              "speakerTag": 1,
              "languageCode": "en-US",
              "confidence": 0.95
            }
          ]
        }
      ]
    }
  ]
}

Sample JSON Format#

The sample diarization JSON format uses a top-level object with a words array. Each entry contains text, start (seconds), end (seconds), and speaker_id. Optional top-level fields include text (full transcript) and language_code.

{
  "language_code": "eng",
  "text": "Perfect. Hey, welcome to the show.",
  "words": [
    {
      "text": "Perfect.",
      "start": 0.219,
      "end": 0.62,
      "speaker_id": "speaker_0"
    },
    {
      "text": "Hey,",
      "start": 0.62,
      "end": 1.019,
      "speaker_id": "speaker_0"
    }
  ]
}

For generating diarization data from an audio stream, refer to NVIDIA Riva ASR Overview.

Running Inference via Python Script#

You can use the sample client script to send a gRPC request to the hosted NIM server:

Go to the Python scripts directory.

Run the command to send a gRPC request:

python active_speaker_detection.py \
   --target <server_ip:port> \
   --video-input <input video file path> \
   --audio-input <input audio file path> \
   --diarization-input <diarization JSON file path> \
   --output <output file path>

Note

The first inference is not indicative of the model’s actual performance because it includes the time taken by the Triton Inference Server to load the models for the first time. The subsequent inference requests reflect the actual processing performance.

To view details of command-line arguments, run the following command:

python active_speaker_detection.py -h

You get a response similar to the following:

usage: active_speaker_detection.py [-h] [--ssl-mode {DISABLED,MTLS,TLS}] [--ssl-key SSL_KEY] [--ssl-cert SSL_CERT] [--ssl-root-cert SSL_ROOT_CERT] [--target TARGET]
                                   [--preview-mode] [--api-key API_KEY] [--function-id FUNCTION_ID] [--video-input VIDEO_INPUT] [--audio-input AUDIO_INPUT]
                                   [--diarization-input DIARIZATION_INPUT] [--skip-audio] [--skip-diarization] [--output OUTPUT]

Run Active Speaker Detection inference with video, audio, and diarization files

Command-Line Arguments#

Argument	Description	Default
`--target`	IP:port of gRPC service.	`127.0.0.1:8001`
`--video-input`	Path to input video file (MP4 format).	`../assets/sample_streamable.mp4`
`--audio-input`	Path to input audio file (WAV, MP3, or Opus format).	`../assets/sample_audio.wav`
`--diarization-input`	Path to diarization JSON file with word-level speaker info.	`../assets/sample_diarization.json`
`--output`	Path for the output video file with speaker bounding boxes.	`speaker_detection_output.mp4`
`--skip-audio`	Skip sending separate audio; use audio embedded in the video stream.	`False`
`--preview-mode`	Send request to the preview NVCF NIM server.	`False`
`--api-key`	NGC API key for authentication. Required in preview mode.	`None`
`--function-id`	NVCF function ID for the service. Required in preview mode.	`None`
`--ssl-mode`	SSL mode: `DISABLED`, `TLS`, or `MTLS`.	`DISABLED`
`--ssl-key`	Path to SSL private key (required for MTLS).	`../ssl_key/ssl_key_client.pem`
`--ssl-cert`	Path to SSL certificate chain (required for MTLS).	`../ssl_key/ssl_cert_client.pem`
`--ssl-root-cert`	Path to SSL root certificate (required for TLS/MTLS).	`../ssl_key/ssl_ca_cert.pem`

Example Commands#

Basic inference with default arguments:
```
python active_speaker_detection.py
```

Run inference with custom input files:

python active_speaker_detection.py \
    --target 127.0.0.1:8001 \
    --video-input ../assets/sample_video_streamable.mp4 \
    --audio-input ../assets/sample_audio.wav \
    --diarization-input ../assets/sample_diarization.json \
    --output speaker_detection_output.mp4

Run inference using embedded audio (skip separate audio file):

python active_speaker_detection.py \
    --target 127.0.0.1:8001 \
    --video-input ../assets/sample_video_streamable.mp4 \
    --diarization-input ../assets/sample_diarization.json \
    --skip-audio

Supported Formats#

Input Type	Supported Formats
Video	MP4 with H.264 encoding
Audio	Linear PCM WAV, MP3, or Opus
Diarization	JSON (Simple format or RIVA ASR format)

Input Modes#

The Active Speaker Detection NIM provides two modes for processing input files: streaming and transactional.

Aspect	Transactional Mode	Streaming Mode
Data Storage	Entire video and audio files are temporarily copied on disk.	Only frames being processed are temporarily copied in memory.
Processing Start	NIM waits to receive entire files before starting.	NIM starts processing as soon as data chunk for first frame arrives.
Processing Timing	Processing begins after all data is received.	Continuous processing without waiting for complete input.
Diarization Data	Complete diarization input is received with video and audio before processing begins.	Diarization must be available at the time of the first inference request. If diarization is not supplied for that request, the session is treated as having no diarization, which can result in incorrect outputs. Provide diarization by interleaving it with stream data or by sending the diarization payload before the first inference call.
Output Delivery	Complete results are returned to client after inference is finished for whole video.	Results are generated and returned immediately per frame.

Streaming Mode#

Streaming mode is the recommended way to use Active Speaker Detection NIM. It allows inference to begin without receiving the whole video from the client. It processes video frames incrementally, and inference begins as soon as the first frame of information is available. The output frames are then streamed back to the client immediately after inference.

The NIM automatically detects streamable videos and enables streaming mode. This mode delivers the lowest latency and best resource efficiency, and it scales well to large files.

Use streaming mode for these use cases:

Best overall performance—the NIM is optimized for this path.
Streamable video inputs.
Applications that benefit from receiving output as it is generated, without waiting for the entire file to be uploaded.
Large video files that benefit from incremental processing and reduced disk I/O.

Streaming mode works with streamable videos in which metadata is positioned at the beginning of the file. The NIM automatically detects streamable videos and enables streaming mode. Videos that are not streamable can be easily converted to a streamable format.

To make any video streamable, use FFmpeg with the following command:

ffmpeg -i sample_video.mp4 -movflags +faststart sample_video_streamable.mp4

Streaming Mode and Diarization#

In streaming mode, diarization must be available at the time of the first inference request. If diarization is not supplied for that request, the session is treated as having no diarization, which can result in incorrect outputs.

Note

We recommend that you provide diarization data by interleaving it with stream data or by sending the diarization payload before the first inference call.

Transactional Mode#

In transactional mode, the video and audio files must be completely received by the NIM before processing can begin. This is the default mode; no flag must be set.

Transactional mode is suitable for the following use cases:

Processing of small video and audio files.
Applications that can wait for complete processing before receiving output.
Videos that are not optimized for streaming (such as non-streamable MP4 files in which metadata is located at the end of the file, requiring the entire file to be downloaded before playback can begin).

To run Active Speaker Detection in transactional mode, run the sample client:

python active_speaker_detection.py --target 127.0.0.1:8001 \
   --video-input ../assets/sample_streamable.mp4 \
   --audio-input ../assets/sample_audio.wav \

Usage for Preview API Request#

python active_speaker_detection.py --preview-mode \
    --target grpc.nvcf.nvidia.com:443 \
    --function-id <FUNCTION_ID> \
    --api-key <NVCF_API_KEY> \
    --video-input ../assets/sample_streamable.mp4 \
    --output out.mp4

Replace <FUNCTION_ID> and <NVCF_API_KEY> with your assigned function ID and API key. In preview mode, the client connects over a secure gRPC channel to the NVCF endpoint.

Output Format#

The sample client produces an output video with speaker bounding boxes overlaid on the original frames. A three-color scheme is used to indicate speaker state:

Color	State
Green	Face is actively speaking.
Blue	Face has an assigned audio track (diarized) but is not speaking.
Red	Face is tracked but has no audio track assigned.

For more information about configuring advanced parameters such as maximum speakers, audio source mode, and debug mode, see Advanced Usage.