Overview#

NVIDIA Studio Voice NIM leverages state-of-the-art AI models to enhance the input speech recorded through low quality microphones in noisy and reverberant environments to studio-recorded quality speech. NVIDIA Studio Voice NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.

NVIDIA Studio Voice NIM has two modes:

Streaming: This mode uses real-time peer-to-peer audio processing with a continuous data flow, making it ideal for live applications like video conferencing and live broadcasting.
Transactional: This mode processes complete audio files in a single request-response interaction, making it better suited for offline audio enhancement and post-production workflows.

Architecture#

NVIDIA Studio Voice utilizes a time-domain convolutional encoder-decoder network with sequential modeling applied to the encoded latent representation. The encoder processes the input speech sample to create a latent speech representation. This representation is conditioned on a preset studio quality embedding using multi-head attention blocks for sequential modeling. The decoder is a waveform convolutional feed-forward network that upsamples the output of the sequential modeling block to produce the final studio-quality audio.

Try It Out#

Try out the NVIDIA Studio Voice NIM at nvidia/studiovoice. Additionally, access the Try API feature to experience the NVIDIA Studio Voice NIM API without hosting your own servers, because Try API uses the NVIDIA Cloud Functions backend.