Overview#
NVIDIA Studio Voice on Holoscan for Media uses state-of-the-art AI models to enhance input speech recorded through low-quality microphones in noisy and reverberant environments and produce studio-recorded quality speech. The NIM ingests a single PCM audio stream over SMPTE ST 2110-30, performs real-time AI-based speech enhancement (dereverberation, restoration of clarity, noise suppression), and publishes the enhanced audio over SMPTE ST 2110-30 for downstream broadcast equipment.
This NIM is designed for live broadcast and production workflows that ingest and egress uncompressed audio over SMPTE ST 2110 using NVIDIA Rivermax. It is deployed with Helm charts on a Holoscan for Media cluster and can optionally be managed through a Kubernetes operator.
Architecture#
The Studio Voice NIM on Holoscan for Media is built on NVIDIA’s software platform:
Holoscan for Media — Media platform for SMPTE ST 2110–oriented live pipelines and Kubernetes-native orchestration.
NVIDIA Rivermax — GPU-accelerated SMPTE ST 2110 and RTP ingest and egress.
CUDA — GPU-accelerated audio processing.
TensorRT — Optimized neural network inference.
Triton Inference Server — Efficient model serving.
NVIDIA AFX SDK — Speech enhancement inference backend for noise suppression, dereverberation, and clarity restoration.
Processing Flow#
The NIM performs the following actions in sequence:
Ingests a SMPTE ST 2110-30 audio RTP stream (Rivermax, GPU-accelerated GStreamer) and prepares decoded audio for inference.
Runs inference through the AFX SDK backend on Triton to enhance the speech, suppressing noise and reverberation.
Emits ST 2110-30 enhanced audio suitable for downstream broadcast equipment.
Logical Pipeline#
Sender ──(ST 2110-30 audio)──► Studio Voice NIM ──(ST 2110-30 enhanced audio)──► Receiver
Piece |
Role |
|---|---|
Sender |
Example workload: TS file → demux → ST 2110-30 audio (demo chart). |
Studio Voice NIM |
Inference and NMOS registry; enhanced ST 2110-30 audio output. |
Receiver |
Consumes enhanced audio, optionally re-streams via SRT. |
Operator variant: Reconciles only the Studio Voice NIM deployment and SDP ConfigMap from an NvidiaStudioVoiceMediaFunction CR.
Inputs and Outputs#
Direction |
Stream Type |
Transport |
Format |
Notes |
|---|---|---|---|---|
Input |
Audio |
PCM L24, 48 kHz, mono |
Single mono speech stream per session. |
|
Output |
Audio |
PCM L24, 48 kHz, mono |
Enhanced audio, same format as input. |
The upstream sender and downstream receiver must use the same sample rate and channel layout as the NIM.
Note
Studio Voice expects a single mono speech stream per input session. For multi-channel mixes, downmix to mono on the sender or instantiate one NIM per speaker.
Deployment#
Studio Voice NIM supports three deployment patterns:
Deployment Option |
Description |
|---|---|
NIM Service Chart |
Deploys the Studio Voice NIM as a standalone Kubernetes Deployment for integration into a custom media pipeline. |
Kubernetes Operator |
Manages Studio Voice NIM workloads through an |
End-to-End Demo Chart |
Deploys a complete demo pipeline—sender, NIM service, and receiver—as a single Helm release. Demonstrates the full sender-to-receiver flow and is recommended for initial evaluation. |
For installation instructions, see Installation.
Supported Operating Modes#
ST 2110 Static Mode#
Media streams are transported over IP networks using predefined multicast IP addresses and ports. Configuration is performed manually, and IP address and port settings must match between sending and receiving devices.
NMOS Mode#
Devices and media streams are automatically discovered and connected using AMWA NMOS standards (IS-04/IS-05). This mode simplifies system setup by removing the need for manual IP address and port configuration.
See Also#
Holoscan for Media documentation.
System Requirements for Holoscan for Media.
Rivermax and ST 2110 for Holoscan for Media.