Overview
NVIDIA Maxine Studio Voice NIM leverages state-of-the-art AI models to enhance the input speech recorded through low quality microphones in noisy and reverberant environments to studio-recorded quality speech. NVIDIA Maxine Studio Voice NIM models are built on the NVIDIA software platform, incorporating CUDA, TensorRT, and Triton to offer out-of-the-box GPU acceleration.
Architecture
NVIDIA Maxine Studio Voice utilizes a time-domain convolutional encoder-decoder network with sequential modeling applied to the encoded latent representation. The encoder processes the input speech sample to create a latent speech representation. This representation is conditioned on a preset studio quality embedding using multi-head attention blocks for sequential modeling. The decoder is a waveform convolutional feed-forward network that upsamples the output of the sequential modeling block to produce the final studio-quality audio.
Try It Out
Try out the NVIDIA Maxine Studio Voice NIM at this link. Additionally, access the Try API feature to experience the NVIDIA Studio Voice NIM API without hosting your own servers, as it leverages the NVIDIA Cloud Functions backend.