Host the Nemotron Speech MCP Server#
This page describes how to host the Nemotron Speech Model Context Protocol (MCP) server that gives agents a live control plane for customer-hosted Speech NIM deployments.
Use the MCP server to help your AI coding agents discover deployed ASR, TTS, or NMT NIMs, choose a configured backend, inspect available models, run file transcription, synthesize short speech responses, translate text, or receive real-time WebSocket instructions for live ASR and TTS.
The MCP server does not replace the native Speech NIM data planes. You can continue using HTTP, gRPC, and real-time WebSocket APIs directly. Use MCP for agent discovery and orchestration.
Download the MCP Server Files#
Download the following files for setting up the MCP server:
File |
Description |
|---|---|
Standalone HTTP MCP server for customer-hosted Speech NIMs. |
|
Example backend configuration for ASR, TTS, and NMT NIMs. |
|
Python dependencies for the standalone MCP server. |
Keep the files in the same directory on the system that hosts the MCP server.
Install Dependencies#
Create a Python environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
The MCP server uses the NVIDIA Riva Python client to connect to Speech NIM gRPC endpoints and PyYAML to read the backend configuration file.
Configure Speech NIM Backends#
Copy the example configuration file:
cp riva_speech_mcp_config.example.yaml riva_speech_mcp_config.yaml
Edit riva_speech_mcp_config.yaml to match deployed Speech NIMs in your environment. The MCP server does not assume which NIMs are running or which ports are exposed. Declare each ASR, TTS, or NMT backend in the config file.
The following example configures ASR, TTS, and NMT backends:
server:
name: nemotron-speech-mcp
title: Nemotron Speech MCP
bearer_token_env: RIVA_MCP_BEARER_TOKEN
allow_origins: []
allow_local_files: false
max_audio_bytes: 104857600
max_request_bytes: 157286400
streaming_chunk_bytes: 65536
backends:
- id: asr-primary
modalities: ["asr"]
grpc_uri: "asr-nim:50051"
realtime_url: "ws://asr-nim:9000/v1/realtime"
public_realtime_url: "wss://speech.example.com/asr/v1/realtime"
default_language_code: "en-US"
- id: tts-primary
modalities: ["tts"]
grpc_uri: "tts-nim:50051"
realtime_url: "ws://tts-nim:9000/v1/realtime"
public_realtime_url: "wss://speech.example.com/tts/v1/realtime"
default_language_code: "en-US"
- id: nmt-primary
modalities: ["nmt"]
grpc_uri: "nmt-nim:50051"
Set grpc_uri and realtime_url to addresses that the MCP server can reach from its host or container network. Set public_realtime_url when agents need a different externally routed WebSocket URL for real-time ASR or TTS. Set max_request_bytes to cap incoming MCP request bodies. Keep it large enough for the base64 audio payloads allowed by max_audio_bytes.
If a deployment includes multiple ASR, TTS, or NMT NIMs, give each backend a stable id. Agents can pass backend_id to tools. When an agent omits backend_id and more than one matching backend exists, the MCP server returns the valid backend IDs instead of guessing.
Run the MCP Server#
Set an optional bearer token and start the MCP server:
export RIVA_MCP_BEARER_TOKEN="<token>"
python3 riva_speech_mcp_server.py \
--config riva_speech_mcp_config.yaml \
--host 0.0.0.0 \
--port 9900
The MCP endpoint is available at:
http://<mcp-host>:9900/mcp
The health endpoint is available at:
http://<mcp-host>:9900/health
In production deployments, run the MCP server in the same private network as the Speech NIMs and expose the MCP endpoint only through an authenticated reverse proxy or another controlled network boundary.
Connect an Agent#
Configure an MCP-compatible agent to use the HTTP MCP endpoint:
{
"mcpServers": {
"nemotron-speech": {
"url": "http://<mcp-host>:9900/mcp",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}
Configuration syntax depends on the agent or MCP client. Use the client’s HTTP or Streamable HTTP server configuration option and point it to /mcp.
MCP Tools#
The standalone MCP server exposes the following tools:
Tool |
Purpose |
|---|---|
|
Checks health for one configured backend or all backends. |
|
Lists configured ASR, TTS, and NMT backends. |
|
Queries live model inventory from configured Speech NIMs. |
|
Transcribes a file or audio blob with a configured ASR backend. |
|
Returns real-time ASR WebSocket URL, events, and message templates. |
|
Synthesizes short text with a configured TTS backend. |
|
Returns real-time TTS WebSocket URL and event flow. |
|
Translates text with a configured NMT backend. |
Real-Time ASR and TTS#
For real-time ASR and TTS, MCP acts as the control plane only. Agents should not send real-time audio chunks through MCP tool calls.
For real-time ASR, the agent calls riva_asr_create_realtime_session. The tool returns a stream_url, event names, message templates, and a flow that instructs the agent to open a native Riva real-time WebSocket. The agent streams base64-encoded raw PCM16 chunks to the WebSocket, commits every 20 to 30 seconds or at a turn boundary, sends a final commit, then sends input_audio_buffer.done.
For real-time TTS, the agent calls riva_tts_create_realtime_session. The tool returns the WebSocket URL and event names for streaming text and receiving audio deltas.
This design keeps high-volume media on the native Riva real-time path while letting agents discover the correct endpoint and protocol through MCP.
Deployment Topologies#
Use one of the following deployment patterns:
Topology |
When to Use It |
|---|---|
Same host as the NIMs |
The MCP server can reach NIMs through |
Same Docker Compose project or Kubernetes namespace |
The MCP server can reach NIMs through service names such as |
Remote gateway |
The MCP server runs separately and connects to reachable NIM endpoints. |
For most self-hosted deployments, run the MCP server near the NIMs and expose only the MCP endpoint to agents.
Next Steps#
Nemotron Speech Agent Skills: Install and use the Nemotron Speech agentic skill.
Prerequisites: Prepare hardware, drivers, and container runtime requirements.
NGC Access Setup: Generate registry credentials for self-hosted Speech NIM containers.
Install the NVIDIA Riva Python Client: Install the client package used in generated gRPC examples.