Overview#

NVIDIA Studio Voice on Holoscan for Media uses state-of-the-art AI models to enhance input speech recorded through low-quality microphones in noisy and reverberant environments and produce studio-recorded quality speech. The NIM ingests a single PCM audio stream over SMPTE ST 2110-30, performs real-time AI-based speech enhancement (dereverberation, restoration of clarity, noise suppression), and publishes the enhanced audio over SMPTE ST 2110-30 for downstream broadcast equipment.

This NIM is designed for live broadcast and production workflows that ingest and egress uncompressed audio over SMPTE ST 2110 using NVIDIA Rivermax. It is deployed with Helm charts on a Holoscan for Media cluster and can optionally be managed through a Kubernetes operator.

Architecture#

The Studio Voice NIM on Holoscan for Media is built on NVIDIA’s software platform:

  • Holoscan for Media — Media platform for SMPTE ST 2110–oriented live pipelines and Kubernetes-native orchestration.

  • NVIDIA Rivermax — GPU-accelerated SMPTE ST 2110 and RTP ingest and egress.

  • CUDA — GPU-accelerated audio processing.

  • TensorRT — Optimized neural network inference.

  • Triton Inference Server — Efficient model serving.

  • NVIDIA AFX SDK — Speech enhancement inference backend for noise suppression, dereverberation, and clarity restoration.

Processing Flow#

The NIM performs the following actions in sequence:

  1. Ingests a SMPTE ST 2110-30 audio RTP stream (Rivermax, GPU-accelerated GStreamer) and prepares decoded audio for inference.

  2. Runs inference through the AFX SDK backend on Triton to enhance the speech, suppressing noise and reverberation.

  3. Emits ST 2110-30 enhanced audio suitable for downstream broadcast equipment.

Logical Pipeline#

Sender ──(ST 2110-30 audio)──► Studio Voice NIM ──(ST 2110-30 enhanced audio)──► Receiver

Piece

Role

Sender

Example workload: TS file → demux → ST 2110-30 audio (demo chart).

Studio Voice NIM

Inference and NMOS registry; enhanced ST 2110-30 audio output.

Receiver

Consumes enhanced audio, optionally re-streams via SRT.

Operator variant: Reconciles only the Studio Voice NIM deployment and SDP ConfigMap from an NvidiaStudioVoiceMediaFunction CR.

Inputs and Outputs#

Direction

Stream Type

Transport

Format

Notes

Input

Audio

SMPTE ST 2110-30

PCM L24, 48 kHz, mono

Single mono speech stream per session.

Output

Audio

SMPTE ST 2110-30

PCM L24, 48 kHz, mono

Enhanced audio, same format as input.

The upstream sender and downstream receiver must use the same sample rate and channel layout as the NIM.

Note

Studio Voice expects a single mono speech stream per input session. For multi-channel mixes, downmix to mono on the sender or instantiate one NIM per speaker.

Deployment#

Studio Voice NIM supports three deployment patterns:

Deployment Option

Description

NIM Service Chart

Deploys the Studio Voice NIM as a standalone Kubernetes Deployment for integration into a custom media pipeline.

Kubernetes Operator

Manages Studio Voice NIM workloads through an NvidiaStudioVoiceMediaFunction custom resource. Suitable for declarative, production-grade management.

End-to-End Demo Chart

Deploys a complete demo pipeline—sender, NIM service, and receiver—as a single Helm release. Demonstrates the full sender-to-receiver flow and is recommended for initial evaluation.

For installation instructions, see Installation.

Supported Operating Modes#

ST 2110 Static Mode#

Media streams are transported over IP networks using predefined multicast IP addresses and ports. Configuration is performed manually, and IP address and port settings must match between sending and receiving devices.

NMOS Mode#

Devices and media streams are automatically discovered and connected using AMWA NMOS standards (IS-04/IS-05). This mode simplifies system setup by removing the need for manual IP address and port configuration.

See Also#