Is this page helpful?

Overview#

NVIDIA Active Speaker Detection NIM uses state-of-the-art AI models to detect and identify active speakers within a video stream through the analysis of visual and diarized audio data. This feature can track multiple speakers across various video cutscenes.

The feature processes video, audio, and diarization data inputs to produce per-frame detection results. These results comprise bounding boxes delineating the identified speakers, speaker identifiers for session tracking, the active speaking state, and corresponding face-detection confidence scores.

NVIDIA Active Speaker Detection NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer ready-to-use GPU acceleration.

Architecture#

The Active Speaker Detection service is built on NVIDIA’s software platform:

CUDA for GPU-accelerated processing.
TensorRT for optimized neural network inference.
Triton Inference Server for efficient model serving and scaling.
NVIDIA Augmented Reality (AR) SDK backend for face detection, facial landmark computation, and speaker detection inference.

The system uses an architecture that performs the following actions:

Decodes input video frames using GPU-accelerated GStreamer (NVDEC).
Extracts and processes frame-accurate audio data aligned with video frames.
Processes diarization timelines to identify per-frame speaker masks.
Runs inference through the AR SDK backend on Triton to detect active speakers.
Returns per-frame results with bounding boxes, speaker IDs, and confidence scores.

Input Processing Modes#

The Active Speaker Detection NIM supports two modes of operation:

Streaming Mode (recommended): In this mode, the input files are streamed to the NIM in chunks. As the chunks arrive, NIM runs inference incrementally and streams the output back to the client in chunks, even before the whole input file is received by the NIM. The NIM automatically detects streamable videos and enables this mode. For best performance, use streamable videos as input.
Transactional Mode: The entire input files are streamed and processed as a complete unit by the NIM before returning results to the client.

For detailed information about when to use each mode and how to convert between file formats, refer to the Input Modes section of Basic Inference.

Try It Out#

Try the NVIDIA Active Speaker Detection NIM at build.nvidia.com/nvidia/active-speaker-detection.

To experience the NVIDIA Active Speaker Detection NIM API without having to host your own servers, use the Try API feature, which uses the NVIDIA Cloud Function backend.