About the Speaker Focus Effect#

Important

The Speaker Focus effect is currently available under the Early Access program. For more details, refer to NVIDIA Maxine Early Access Program.

Note

In this guide, the Speaker Separation Effect is used interchangeably with Speaker Focus (referred to as speaker_focus in the API).

Audio with people speaking in the background is generally unintelligible. The Speaker Focus Effect identifies and isolates the primary speaker from all other speakers and removes the speech of all other speakers from the input audio. This significantly improves the intelligibility of the speech of the primary speaker in the input audio.

This effect is robust against the following types of background noises and suppresses them to a certain extent:

  • AC noise

  • Clapping

  • Fan noise

  • Keyboard

  • Mouse clicks

  • PC noise

  • Sounds of a vacuum cleaner

  • Tapping

To run the sample application on Windows for this effect, use the following command:

# (One time, initial setup): Download models using models/download_models.ps1
powershell -ExecutionPolicy Bypass -File ./download_models.ps1 --gpu_architecture <gpu> --effects speaker_focus-16k,speaker_focus-48k

# Format: run_effects_demo.bat -g ^<architecture^> -e ^<effect^> -isr ^<input_sr^> -osr ^<output_sr^> -ir ^<intensity_ratio^> -ev ^<effect_version^> -vad ^<enable_vad^>

# 16k effect on turing GPU
run_effects_demo.bat -g turing -e speaker_focus -isr 16k -osr 16k

# 48k effect on ampere GPU
run_effects_demo.bat -g ampere -e speaker_focus -isr 48k -osr 48k

Note

For more information, see Use the Helper Script to Run the Sample Application.

To run the sample application on Linux for this effect, use the following command:

# (One time, initial setup): Download models using models/download_models.sh
./download_models.sh --gpu <gpu> --effects speaker_focus-16k,speaker_focus-48k

# Refer to Section 3.2 for further details
Format: ./run_effect.sh -g <gpu> -s <sample_rate> -e speaker_focus

# 16k effect
./run_effect.sh -g t4 -s 16 -e speaker_focus

# 48k effect
./run_effect.sh -g t4 -s 48 -e speaker_focus

Note

For more information, see Use the Helper Script to Run the Sample Application.

This effect has the following characteristics:

  • Supported input/output format is 32-bit float audio with a sampling rate of 16 kHz or 48 kHz.

  • Supports mono-channel input and output.

  • Supports audio with up to four speakers and the noises listed earlier.

  • In the Linux SDK, this effect has the following maximum throughput (the number of batches supported in real time):

    Architecture

    Maximum Throughput for the 16K Effect

    Maximum Throughput for the 48K Effect

    T4

    3040

    1350

    A100

    15760

    6700

    A10

    6480

    2900

    L40

    13890

    5120

    H100

    18500

    8030

    B100

    25030

    11500

    RTX PRO 6000

    25030

    10380