Is this page helpful?

Overview#

NVIDIA Studio Voice on Holoscan for Media uses state-of-the-art AI models to enhance input speech recorded through low-quality microphones in noisy and reverberant environments and produce studio-recorded quality speech. The NIM ingests a single PCM audio stream over SMPTE ST 2110-30, performs real-time AI-based speech enhancement (dereverberation, restoration of clarity, noise suppression), and publishes the enhanced audio over SMPTE ST 2110-30 for downstream broadcast equipment.

This NIM is designed for live broadcast and production workflows that ingest and egress uncompressed audio over SMPTE ST 2110 using NVIDIA Rivermax. It is deployed with Helm charts on a Holoscan for Media cluster and can optionally be managed through a Kubernetes operator.

Architecture#

The Studio Voice NIM on Holoscan for Media is built on NVIDIA’s software platform:

Holoscan for Media — Media platform for SMPTE ST 2110–oriented live pipelines and Kubernetes-native orchestration.
NVIDIA Rivermax — GPU-accelerated SMPTE ST 2110 and RTP ingest and egress.
CUDA — GPU-accelerated audio processing.
TensorRT — Optimized neural network inference.
Triton Inference Server — Efficient model serving.
NVIDIA AFX SDK — Speech enhancement inference backend for noise suppression, dereverberation, and clarity restoration.

Processing Flow#

The NIM performs the following actions in sequence:

Ingests a SMPTE ST 2110-30 audio RTP stream (Rivermax, GPU-accelerated GStreamer) and prepares decoded audio for inference.
Runs inference through the AFX SDK backend on Triton to enhance the speech, suppressing noise and reverberation.
Emits ST 2110-30 enhanced audio suitable for downstream broadcast equipment.

Logical Pipeline#

Sender ──(ST 2110-30 audio)──► Studio Voice NIM ──(ST 2110-30 enhanced audio)──► Receiver

Piece	Role
Sender	Example workload: TS file → demux → ST 2110-30 audio (demo chart).
Studio Voice NIM	Inference and NMOS registry; enhanced ST 2110-30 audio output.
Receiver	Consumes enhanced audio, optionally re-streams via SRT.

Operator variant: Reconciles only the Studio Voice NIM deployment and SDP ConfigMap from an NvidiaStudioVoiceMediaFunction CR.

Inputs and Outputs#

Direction	Stream Type	Transport	Format	Notes
Input	Audio	SMPTE ST 2110-30	PCM L24, 48 kHz, mono	Single mono speech stream per session.
Output	Audio	SMPTE ST 2110-30	PCM L24, 48 kHz, mono	Enhanced audio, same format as input.

The upstream sender and downstream receiver must use the same sample rate and channel layout as the NIM.

Note

Studio Voice expects a single mono speech stream per input session. For multi-channel mixes, downmix to mono on the sender or instantiate one NIM per speaker.

Deployment#

Studio Voice NIM supports three deployment patterns:

Deployment Option	Description
NIM Service Chart	Deploys the Studio Voice NIM as a standalone Kubernetes Deployment for integration into a custom media pipeline.
Kubernetes Operator	Manages Studio Voice NIM workloads through an `NvidiaStudioVoiceMediaFunction` custom resource. Suitable for declarative, production-grade management.
End-to-End Demo Chart	Deploys a complete demo pipeline—sender, NIM service, and receiver—as a single Helm release. Demonstrates the full sender-to-receiver flow and is recommended for initial evaluation.

For installation instructions, see Installation.

Supported Operating Modes#

ST 2110 Static Mode#

Media streams are transported over IP networks using predefined multicast IP addresses and ports. Configuration is performed manually, and IP address and port settings must match between sending and receiving devices.

NMOS Mode#

Devices and media streams are automatically discovered and connected using AMWA NMOS standards (IS-04/IS-05). This mode simplifies system setup by removing the need for manual IP address and port configuration.