About the Speaker Focus Effect#
Important
The Speaker Focus effect is currently available under the Early Access program. For more details, refer to NVIDIA Maxine Early Access Program.
Note
In this guide, the Speaker Separation Effect is used interchangeably with Speaker Focus (referred to as speaker_focus in the API).
Audio with people speaking in the background is generally unintelligible. The Speaker Focus Effect identifies and isolates the primary speaker from all other speakers and removes the speech of all other speakers from the input audio. This significantly improves the intelligibility of the speech of the primary speaker in the input audio.
This effect is robust against the following types of background noises and suppresses them to a certain extent:
AC noise
Clapping
Fan noise
Keyboard
Mouse clicks
PC noise
Sounds of a vacuum cleaner
Tapping
To run the sample application on Windows for this effect, use the following command:
# (One time, initial setup): Download models using models/download_models.ps1
powershell -ExecutionPolicy Bypass -File ./download_models.ps1 --gpu_architecture <gpu> --effects speaker_focus-16k,speaker_focus-48k
# Format: run_effects_demo.bat -g ^<architecture^> -e ^<effect^> -isr ^<input_sr^> -osr ^<output_sr^> -ir ^<intensity_ratio^> -ev ^<effect_version^> -vad ^<enable_vad^>
# 16k effect on turing GPU
run_effects_demo.bat -g turing -e speaker_focus -isr 16k -osr 16k
# 48k effect on ampere GPU
run_effects_demo.bat -g ampere -e speaker_focus -isr 48k -osr 48k
Note
For more information, see Use the Helper Script to Run the Sample Application.
To run the sample application on Linux for this effect, use the following command:
# (One time, initial setup): Download models using models/download_models.sh
./download_models.sh --gpu <gpu> --effects speaker_focus-16k,speaker_focus-48k
# Refer to Section 3.2 for further details
Format: ./run_effect.sh -g <gpu> -s <sample_rate> -e speaker_focus
# 16k effect
./run_effect.sh -g t4 -s 16 -e speaker_focus
# 48k effect
./run_effect.sh -g t4 -s 48 -e speaker_focus
Note
For more information, see Use the Helper Script to Run the Sample Application.
This effect has the following characteristics:
Supported input/output format is 32-bit float audio with a sampling rate of 16 kHz or 48 kHz.
Supports mono-channel input and output.
Supports audio with up to four speakers and the noises listed earlier.
In the Linux SDK, this effect has the following maximum throughput (the number of batches supported in real time):
Architecture
Maximum Throughput for the 16K Effect
Maximum Throughput for the 48K Effect
T4
3040
1350
A100
15760
6700
A10
6480
2900
L40
13890
5120
H100
18500
8030
B100
25030
11500
RTX PRO 6000
25030
10380