Limitations and Known Behaviors#

Audio Stream Requirements#

Individual input audio streams must contain only a single speaker’s audio. During silence, audio samples must be zero. If background noise is present, enable audio thresholding via useAudioThresholdToDetectActiveAudioStream and audioThresholdDb. For details, refer to Configuration Reference.

Latency#

The Active Speaker Detection model requires an initial 3-second processing window to produce its first result. A look-ahead queue then buffers output frames—default 30 frames at 30 fps ≈ 1 second—to ensure smooth downstream transmission. This gives a fixed end-to-end latency of approximately 4 seconds at 30 fps with default settings.

The look-ahead queue depth is configurable via outputFrameBufferSize. For parameter details, refer to Configuration reference.

Throughput#

The Active Speaker Detection NIM is optimized to adhere to SMPTE ST 2110 real-time processing standards. Performance scales with the number of tracked faces, diarized audio streams, and the GPU SKU used. Higher-end GPUs and lighter workloads achieve greater throughput, while more demanding combinations (more faces and more audio streams) require more compute.