Advanced Usage#

Model Caching#

When the container launches for the first time, it downloads the required models from NGC. To avoid downloading the models on subsequent runs, you can cache them locally by using a cache directory:

# Create the cache directory on the host machine.
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
chmod 777 $LOCAL_NIM_CACHE

# Choose manifest profile ID based on target architecture.
export MANIFEST_PROFILE_ID=<enter_valid_manifest_profile_id>

# Run the container with the cache directory mounted in the appropriate location.
docker run -it --rm --name=active-speaker-detection-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_MANIFEST_PROFILE=$MANIFEST_PROFILE_ID \
  -e NIM_HTTP_API_PORT=8000 \
  -e NIM_GRPC_API_PORT=8001 \
  -p 8000:8000 \
  -p 8001:8001 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  nvcr.io/nim/nvidia/active-speaker-detection:latest

For more information about MANIFEST_PROFILE_ID, refer to Model Manifest Profiles.

SSL Enablement#

Active Speaker Detection NIM provides an SSL mode to ensure secure communication between clients and the server by encrypting data in transit. To enable SSL, you must provide the path to the SSL certificate and key files in the container. The following example shows how to do this:

export NGC_API_KEY=<add-your-api-key>
export SSL_CERT=path/to/ssl_key

docker run -it --rm --name=active-speaker-detection-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -v $SSL_CERT:/opt/nim/crt/:ro \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  -p 8001:8001 \
  -e NIM_SSL_MODE="mtls" \
  -e NIM_SSL_CA_CERTS_PATH="/opt/nim/crt/ssl_ca.pem" \
  -e NIM_SSL_CERT_PATH="/opt/nim/crt/ssl_cert_server.pem" \
  -e NIM_SSL_KEY_PATH="/opt/nim/crt/ssl_key_server.pem" \
  nvcr.io/nim/nvidia/active-speaker-detection:latest

NIM_SSL_MODE can be set to mtls, tls, or disabled. If set to mtls, the container uses mutual TLS authentication. If set to tls, the container uses TLS authentication. For more information, refer to Environment Variables.

Be sure to verify the permissions of the SSL certificate and key files on the host machine. The container cannot access the files if they are not readable by the user running the container.

Multiple Concurrent Inputs#

To run the server in multi-input concurrent mode, set the environment variable NV_AI4M_MAX_CONCURRENCY_PER_GPU to an integer greater than 1 in the server container. The server then accepts as many concurrent inputs per GPU as specified by the NV_AI4M_MAX_CONCURRENCY_PER_GPU variable.

Because Triton distributes the workload equally across all GPUs, the total number of concurrent inputs supported by the server is the number of GPUs multiplied by NV_AI4M_MAX_CONCURRENCY_PER_GPU.

The Active Speaker Detection NIM uses NVENC/NVDEC hardware acceleration for video decoding.

GPUs without NVDEC hardware support might not be supported.
Some GPUs support only a limited number of concurrent NVDEC sessions, which means the NIM can process only that same number of concurrent inputs on those GPUs.

For details, refer to the Video Encode and Decode Support Matrix.

Note

If incoming requests to the NIM exceed the GPU’s maximum concurrent decode limit, the processing might fail.

Maximum Speakers Configuration#

The NV_AI4M_ASD_MAX_SPEAKERS environment variable controls the maximum number of audio speaker streams that the Active Speaker Detection NIM can handle simultaneously. This setting affects both the gRPC service and the Triton backend.

Default: 5
Range: 1 to 5

To configure the maximum number of speakers:

docker run -it --rm --name=active-speaker-detection-nim \
  --runtime=nvidia \
  --gpus all \
  --shm-size=8GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NV_AI4M_ASD_MAX_SPEAKERS=2 \
  -e NIM_HTTP_API_PORT=8000 \
  -e NIM_GRPC_API_PORT=8001 \
  -p 8000:8000 \
  -p 8001:8001 \
  nvcr.io/nim/nvidia/active-speaker-detection:latest

Note

The NV_AI4M_ASD_MAX_SPEAKERS value must match the number of speaker streams expected in your use case.

Audio Source Configuration#

The Active Speaker Detection NIM supports two modes for audio input feed, configured by the client at request time:

Separate Stream (AUDIO_SOURCE_CONFIG_SEPARATE_STREAM): The client provides a separate audio file alongside the video. This is the default behavior. Use the --audio-input argument in the client to specify the audio file path.
Embedded in Video (AUDIO_SOURCE_CONFIG_EMBEDDED_IN_VIDEO): Audio is demuxed from the supplied video container. No separate audio file is needed. Use the --skip-audio flag in the client.

Separate Audio Stream#

python active_speaker_detection.py --target 127.0.0.1:8001 \
   --video-input /path/to/video.mp4 \
   --audio-input /path/to/audio.wav \
   --diarization-input /path/to/diarization.json

Embedded Audio#

python active_speaker_detection.py --target 127.0.0.1:8001 \
   --video-input /path/to/video_with_audio.mp4 \
   --diarization-input /path/to/diarization.json \
   --skip-audio

Speaker Detection Threshold#

The speaker_detection_threshold field in the ActiveSpeakerDetectionConfig proto controls the minimum confidence score required to classify a detected face as actively speaking.

Valid range: (0, 1) open interval (exclusive of both endpoints).
Proto field: optional float speaker_detection_threshold in ActiveSpeakerDetectionConfig.

A lower threshold is more permissive, resulting in more faces being classified as speaking. A higher threshold is more strict, resulting in fewer faces classified as speaking. The face-detection bounding boxes themselves are not affected by this threshold—only the is_speaking classification changes.

To set the threshold, include the speaker_detection_threshold field in the ActiveSpeakerDetectionConfig message sent as the first request in the gRPC stream:

detection_config = ActiveSpeakerDetectionConfig(
    input_video_config=video_config,
    input_audio_config=audio_config,
    audio_source_config=AUDIO_SOURCE_CONFIG_SEPARATE_STREAM,
    speaker_detection_threshold=0.75,
)

If the field is omitted or set to a value less than or equal to zero, the server uses the default threshold value configured. Values outside the valid range (such as 1.5 or -0.5) are rejected by the server with an INVALID_ARGUMENT gRPC error.

For more information about Maxine NIM clients, refer to the GitHub repository NVIDIA-Maxine/nim-clients.