About the Speaker Focus Effect#

Important

The Speaker Focus effect is currently available under the Early Access program. For more details, refer to NVIDIA Maxine Early Access Program.

Note

In this guide, the Speaker Separation Effect is used interchangeably with Speaker Focus (referred to as speaker_focus in the API).

Audio with people speaking in the background is generally unintelligible. The Speaker Focus Effect identifies and isolates the primary speaker from all other speakers and removes the speech of all other speakers from the input audio. This significantly improves the intelligibility of the speech of the primary speaker in the input audio.

This effect is robust against the following types of background noises and suppresses them to a certain extent:

AC noise
Clapping
Fan noise
Keyboard
Mouse clicks
PC noise
Sounds of a vacuum cleaner
Tapping

To run the sample application on Windows for this effect, use the following command:

# (One time, initial setup): Download models using models/download_models.ps1
powershell -ExecutionPolicy Bypass -File ./download_models.ps1 --gpu_architecture <gpu> --effects speaker_focus-16k,speaker_focus-48k

# Format: run_effects_demo.bat -g ^<architecture^> -e ^<effect^> -isr ^<input_sr^> -osr ^<output_sr^> -ir ^<intensity_ratio^> -ev ^<effect_version^> -vad ^<enable_vad^>

# 16k effect on turing GPU
run_effects_demo.bat -g turing -e speaker_focus -isr 16k -osr 16k

# 48k effect on ampere GPU
run_effects_demo.bat -g ampere -e speaker_focus -isr 48k -osr 48k

Note

For more information, see Use the Helper Script to Run the Sample Application.

To run the sample application on Linux for this effect, use the following command:

# (One time, initial setup): Download models using models/download_models.sh
./download_models.sh --gpu <gpu> --effects speaker_focus-16k,speaker_focus-48k

# Refer to Section 3.2 for further details
Format: ./run_effect.sh -g <gpu> -s <sample_rate> -e speaker_focus

# 16k effect
./run_effect.sh -g t4 -s 16 -e speaker_focus

# 48k effect
./run_effect.sh -g t4 -s 48 -e speaker_focus

Note

For more information, see Use the Helper Script to Run the Sample Application.

This effect has the following characteristics:

Supported input/output format is 32-bit float audio with a sampling rate of 16 kHz or 48 kHz.
Supports mono-channel input and output.
Supports audio with up to four speakers and the noises listed earlier.

In the Linux SDK, this effect has the following maximum throughput (the number of batches supported in real time):

Architecture	Maximum Throughput for the 16K Effect	Maximum Throughput for the 48K Effect
T4	3040	1350
A100	15760	6700
A10	6480	2900
L40	13890	5120
H100	18500	8030
B100	25030	11500
RTX PRO 6000	25030	10380