About the Voice Font Effect#

Important

The Voice Font effect is currently available under the Early Access program. For more details, refer to NVIDIA Maxine Early Access Program.

The Voice Font effect is an any-to-any voice conversion system that can be used to convert the input voice to match the ref speaker’s voice with only 30 seconds of the reference speaker’s speech. The processed output retains the original linguistic information and prosody.

This effect has two variations:

Voice Font High Quality
Voice Font Low Latency

Both variations support the following configurations:

16-kHz input/reference audio and 16-kHz output audio.
48-kHz input/reference audio and 48-kHz output audio.

Note

This effect is tested mainly with speakers using English, and might not work well with other languages. For support with other languages, contact maxine-support@nvidia.com.

To ensure that the quality of the output is optimal, follow these guidelines when recording reference audio:

Use high-quality external microphones or wired headsets to record audio. Audio recorded on low-end Bluetooth or built-in laptop microphones can introduce artifacts and should be avoided.
When recording audio, position the microphone 6 to 10 inches from the speaker’s mouth for clear and loud audio. If available, enable automatic gain control (AGC). Excessively loud or quiet recordings can lead to suboptimal audio output quality.
Record audio in a quiet environment with minimal background noise and reverbs.
Disable noise-cancellation plugins or software during recording, because they might negatively affect the quality of the output.
Adjust the input levels on the recording device to ensure the audio volume is not too loud or too soft (at least greater than -10db). For optimal levels, refer to the volume levels in the sample files provided in the SDK.
Record reference files with a neutral emotional tone. For best results, use dedicated audio software such as Audacity or Adobe Audition.

Voice Font High Quality#

This effect is intended for offline or batch use cases with better quality.

This effect has the following limitations:

Each input audio packet must be 800 ms, in either 16 kHz or 48 kHz.
Reference audio of 30 seconds must be set only once before streaming input audio (either 16 kHz or 48 kHz, depending on how the effect is configured).
On Linux, this effect supports a maximum batch size of 4.

To run the effect on Windows, use the following command:

# (One time, initial setup): Download models using models/download_models.ps1
powershell -ExecutionPolicy Bypass -File ./download_models.ps1 --gpu_architecture <gpu> --effects voice_font

# Format: run_effects_demo.bat -g ^<architecture^> -e ^<effect^> -isr ^<input_sr^> -osr ^<output_sr^> -ir ^<intensity_ratio^> -ev ^<effect_version^> -vad ^<enable_vad^>

# 16k effect on turing GPU
run_effects_demo.bat -g turing -e voice_font_high_quality -isr 16k -osr 16k

# 48k effect on ampere GPU
run_effects_demo.bat -g ampere -e voice_font_high_quality -isr 48k -osr 48k

Note

For more information, see Use the Helper Script to Run the Sample Application.

To run the sample application on Linux for this effect, use the following command:

# (One time, initial setup): Download models using models/download_models.sh
./download_models.sh --gpu <gpu> --effects voice_font

# Refer to Section 3.2 for further details
Format: ./run_effect.sh -g <gpu> -s <sample_rate> -e voice_font_high_quality

# 16k effect
./run_effect.sh -g t4 -s 16 -e voice_font_high_quality -b 2

# 48k effect
./run_effect.sh -g t4 -s 48 -e voice_font_high_quality -b 2

Note

For more information, see Use the Helper Script to Run the Sample Application.

Voice Font Low Latency#

The Voice Font Low Latency effect is a low-latency version intended for real-time use cases. It has a latency of 170 ms, compared to 1.6 sec for Voice Font High Quality.

This effect has the following limitations:

Each input audio packet must be 160 ms, in either 16 kHz or 48 kHz.
Reference audio of 30 seconds must be set only once before streaming input audio (either 16 kHz or 48 kHz, depending on how the effect is configured).
On Linux, this effect supports a maximum batch size of 4. This effect is not currently supported on Windows.

Note

This effect is optimized specifically for real-time use cases, where input audio is provided in small chunks and output audio is expected back with minimal delay.

This effect is not suitable for offline use cases, such as post-processing, because it processes audio in very small input chunks. The high-quality effect should be used for such cases, because it has a larger input audio chunk size, inevitably making it faster than the low-latency effect (because it operates on large parts of the input per processing call and thus requires fewer processing calls).

To run the effect on Windows, use the following command:

:: For SDK Developer Package:

:: Format: run_effects_demo.bat -g ^<architecture^> -e ^<effect^> -isr ^<input_sr^> -osr ^<output_sr^> -ir ^<intensity_ratio^> -ev ^<effect_version^> -vad ^<enable_vad^>

:: 16k effect on turing GPU
run_effects_demo.bat -g turing -e voice_font_low_latency -isr 16k -osr 16k

:: 48k effect on ampere GPU
run_effects_demo.bat -g ampere -e voice_font_low_latency -isr 48k -osr 48k

Note

For more information, see Use the Helper Script to Run the Sample Application.

To run the sample application on Linux for this effect, use the following command:

# Format: ./run_effect.sh -g <gpu> -s <sample_rate> -e voice_font_low_latency

# 16k effect
./run_effect.sh -g t4 -s 16 -e voice_font_low_latency -b 2

# 48k effect
./run_effect.sh -g t4 -s 48 -e voice_font_low_latency -b 2

Note

For more information, see Use the Helper Script to Run the Sample Application.