Overview#

NVIDIA Active Speaker Detection on Holoscan for Media uses state-of-the-art AI models to detect and identify active speakers within a live video stream through the analysis of visual input and one or more diarized audio streams. The feature can track multiple speakers across various video cutscenes.

This NIM is designed for live broadcast and production workflows that ingest and egress uncompressed video and audio over SMPTE ST 2110 using NVIDIA Rivermax. It is deployed with Helm charts on a Holoscan for Media cluster and can optionally be managed through a Kubernetes operator.

The NIM processes live video and one or more diarized audio streams to produce per-frame detection results. These results comprise:

  • Bounding boxes delineating the detected speakers.

  • Speaker identifiers for session tracking.

  • Active speaking state per speaker.

  • Face-detection confidence scores.

Detection results are transmitted downstream as SMPTE ST 2110-40 ancillary data. Optionally, the NIM can emit an output video stream with bounding box overlays for preview or validation purposes.

Architecture#

The Active Speaker Detection NIM on Holoscan for Media is built on NVIDIA’s software platform:

  • Holoscan for Media — media platform for SMPTE ST 2110–oriented live pipelines and Kubernetes-native orchestration.

  • NVIDIA Rivermax — GPU-accelerated SMPTE ST 2110 streaming.

  • NVIDIA DeepStream — GStreamer-based media pipelines and accelerated video preprocessing.

  • CUDA — GPU-accelerated processing.

  • TensorRT — optimized neural network inference.

  • Triton Inference Server — efficient model serving.

  • NVIDIA Augmented Reality SDK — face detection, facial landmark computation, and speaker detection inference.

Processing Flow#

The following diagram shows how media flows through the Active Speaker Detection NIM on Holoscan for Media:

Active Speaker Detection pipeline

The pipeline performs the following actions in sequence:

  1. Receives uncompressed video and one or more audio streams via SMPTE ST 2110 over the high-speed network interface.

  2. Preprocesses video frames with GPU-accelerated GStreamer (DeepStream).

  3. Aligns frame-accurate audio samples with the corresponding video frames.

  4. Processes diarization timelines to identify per-frame speaker-to-audio-stream assignments.

  5. Runs inference through the NVIDIA Augmented Reality SDK backend on Triton Inference Server to detect active speakers.

  6. Emits per-frame results—bounding boxes, speaker IDs, and confidence scores—as SMPTE ST 2110-40 ancillary data.

  7. Optionally emits an output video stream with bounding box overlays when testFrameOverlayMode is enabled.

Inputs and Outputs#

Direction

Stream Type

Transport

Format

Supported Configurations

Input

Video

SMPTE ST 2110-20

Uncompressed YCbCr 4:2:2 10-bit

Resolutions: 720p, 1080p, 4K UHD. Frame rates: 23.97 (24000/1001), 25, 29.97 (30000/1001), 30, 50, 59.94 (60000/1001), and 60 fps. See also GPU encode/decode support matrix.

Input

Audio

SMPTE ST 2110-30

Mono PCM L24, 48 kHz

One stream per speaker; multiple streams supported

Output

Ancillary data

SMPTE ST 2110-40

SMPTE 291M

Per-frame bounding boxes, speaker IDs, and confidence scores

Output

Video (optional)

SMPTE ST 2110-20

Matches input raster

Bounding box overlay; requires testFrameOverlayMode: true

Deployment#

Three deployment patterns are supported:

Deployment Option

Description

NIM service chart

Deploys the Active Speaker Detection NIM as a standalone Kubernetes deployment for integration into a custom media pipeline.

Kubernetes operator

Manages Active Speaker Detection NIM workloads through a NvidiaActiveSpeakerDetectionMediaFunction custom resource. Suitable for declarative, production-grade management.

End-to-end demo chart

Deploys a complete demo pipeline—sender, NIM service, and receiver—as a single Helm release. Demonstrates the full sender-to-receiver flow and is recommended for initial evaluation.

For installation instructions, see Installation.

Supported Operating Modes#

The Active Speaker Detection solution supports the following industry-standard operating modes when deployed with the Helm charts or operator.

ST 2110 Static Mode#

Media streams such as video, audio, and ancillary data are transported over IP networks using predefined multicast IP addresses and ports. Configuration is performed manually, and IP and port settings must match between sending and receiving devices to ensure proper media flow.

NMOS Mode#

Devices and media streams are automatically discovered and connected using AMWA NMOS standards. This mode simplifies system setup by removing the need for manual IP address and port configuration.

For configuration details, see Configuration Reference and the End-to-End Demo and NIM Service installation pages.

See Also#