NVIDIA Speech NIM Microservices Release Notes#

This page lists changes, fixes, and known issues for each NVIDIA Speech NIM microservices release.

All Speech NIM microservice updates are released together as a collection and follow calendar versioning YY.MM.n, where YY is the year, MM is the month, and n is the patch number within that cycle.


Release 26.05.0#

NIM Container Versions#

NIM

Container Tag

Nemotron ASR Streaming

nemotron-asr-streaming:1.2.0

Chatterbox TTS Multilingual

chatterbox-tts-multilingual:1.0.0

Highlights#

  • Added runtime and deployment customization guidance for ASR word boosting, end-of-utterance handling, and TTS emotion exaggeration.

  • Added Chatterbox TTS Multilingual, a community text-to-speech model with streaming and offline inference support.

  • Added multilingual support to the Nemotron ASR Streaming NIM. Two model types are now available:

    • type=en-US: English (US) only (existing behavior).

    • type=multi: Supports 40 language locales with automatic language detection. Omit language_code for automatic detection, or pass a specific code (for example, fr-FR) to constrain decoding. See Supported Languages by Model Type for the full list.

  • Updated ASR and TTS performance data for supported GPUs and added new Chatterbox TTS Multilingual performance tables.

ASR NIM#

Key Features#

  • Added global word boosting guidance for RNNT, TDT, and Nemotron ASR models, including setup instructions and latency guidance for large boosted word lists.

  • Added a word boosting latency comparison for per-stream and global word boosting.

  • Added guidance for detecting finalized streaming transcripts with is_final and for enabling interim transcripts with --show-intermediate.

  • Added force_eou runtime configuration guidance for Nemotron ASR Streaming, which lets clients force end-of-utterance finalization without closing the stream.

  • Updated Nemotron ASR Streaming deployment guidance to use name=nemotron-asr-streaming profile selection.

  • Doubled the maximum concurrent streams supported by Nemotron ASR Streaming on H100. The new batch_size=64 profile sustains up to 512 streams, compared to 256 streams in 1.0.0.

  • Added the type=multi Nemotron ASR Streaming variant, which covers 40 language locales across three quality tiers:

    • Transcription-ready (19 locales): Highest-accuracy ASR, ready out of the box: Arabic (ar-AR), Dutch (nl-NL), English UK (en-GB), English US (en-US), French Canada (fr-CA), French France (fr-FR), German (de-DE), Hindi India (hi-IN), Italian (it-IT), Japanese (ja-JP), Korean (ko-KR), Portuguese Brazil (pt-BR), Portuguese Portugal (pt-PT), Russian (ru-RU), Spanish Spain (es-ES), Spanish US (es-US), Turkish (tr-TR), Ukrainian (uk-UA), Vietnamese (vi-VN).

    • Broad-coverage (13 locales): Production ASR across an additional 13 locales: Bulgarian (bg-BG), Chinese Simplified (zh-CN), Croatian (hr-HR), Czech (cs-CZ), Danish (da-DK), Estonian (et-EE), Finnish (fi-FI), Hungarian (hu-HU), Norwegian Bokmål (nb-NO), Polish (pl-PL), Romanian (ro-RO), Slovak (sk-SK), Swedish (sv-SE).

    • Adaptation-ready (8 locales): Recognized by the tokenizer and designed for fine-tuning on in-domain data to unlock full transcription: Greek (el-GR), Hebrew (he-IL), Latvian (lv-LV), Lithuanian (lt-LT), Maltese (mt-MT), Norwegian Nynorsk (nn-NO), Slovenian (sl-SI), Thai (th-TH).

TTS NIM#

Key Features and Enhancements#

  • Added Chatterbox TTS Multilingual, which supports 23 languages with a single built-in default speaker per locale.

  • Added per-request emotion exaggeration customization for Chatterbox TTS Multilingual with the exaggeration_factor custom configuration parameter.

  • Added Chatterbox TTS Multilingual performance benchmarks for A100, B200, H100, and L40 GPUs.

Known Issues#

  • Chatterbox TTS Multilingual quality is strongest for English. Non-English languages can have pronunciation issues, mixed-language output, hallucinated words, and speed-up artifacts.

Support Matrix and Compatibility Updates#

The following list summarizes the updated models and their support matrices:

  • Updated profiles, memory requirements, customization support, and performance data for the following ASR model:

  • Updated language support, voice catalog, customization support, and performance data for the following TTS models:

To find the latest support matrix for the NVIDIA Speech NIM microservices, refer to Support Matrix.


Release 26.02.0#

NIM Container Versions#

NIM

Container Tag

Parakeet 1.1b CTC en-US

parakeet-1-1b-ctc-en-us:1.5.0

Parakeet 1.1b RNNT Multilingual

parakeet-1-1b-rnnt-multilingual:1.5.0

Parakeet 0.6b TDT en-US

parakeet-0.6b-tdt:1.3.0

Magpie TTS Multilingual

magpie-tts-multilingual:1.7.0

Highlights#

  • Consolidated the previously independent NVIDIA Riva ASR, TTS, and NMT NIMs into a single collection, NVIDIA Speech NIM Microservices, that follows unified calendar versioning (YY.MM.n).

  • Launched the new NVIDIA Speech NIM microservices documentation with the consolidation and renaming of the NVIDIA Riva ASR, TTS, and NMT NIMs. This documentation is a comprehensive guide to NVIDIA Speech NIM microservices for ASR, TTS, and NMT.

ASR NIM#

Key Features#

  • Renamed and expanded Parakeet 0.6b TDT to support two model types: English-only (type=default, parakeet-tdt-0.6b-v2) and multilingual (type=multi, parakeet-tdt-0.6b-v3) with 25 European languages. Use CONTAINER_ID=parakeet-0.6b-tdt and language code multi for auto language detection.

  • Added three model types to Parakeet 1.1b RNNT Multilingual: Default (auto language detection), Prompt (improved accuracy, client-specified language), and Indic (optimized for Indic languages). Expanded language support table including Bengali (bn-IN), Tamil (ta-IN).

  • Extended word boosting to Parakeet TDT and Parakeet RNNT models. RNNT/TDT use boost score range 0.5–2.0 (CTC uses 20–100). Added custom pronunciation using word boosting with explicit tokenization for CTC models.

  • Improved latency and throughput for Silero VAD and Sortformer diarizer for Parakeet 1.1b CTC and Parakeet 1.1b RNNT NIMs.

  • Added VAD-based end-of-utterance detection for Parakeet 1.1b RNNT NIM.

  • Added Nemotron ASR Streaming NIM, which supports streaming mode only.

Known Issues#

  • The Parakeet 1.1b RNNT Multilingual model generates spaces after every character in the transcript for languages such as Japanese. To generate output without spaces, pass the language_code=ja-JP parameter from the client.

  • The Parakeet 1.1b RNNT Multilingual model has speaker diarization enabled for all profiles. The mode=all profiles up to 50 GB of GPU memory. For GPUs with lower memory, deploy only one or two modes instead of all modes.

  • Transducer models (Parakeet RNNT, Parakeet TDT) can emit identical start/end timestamps for words when multiple tokens share the same timestamp.

  • Punctuation output from the Nemotron ASR Streaming model is not always consistent. For strict punctuation, implement post-processing as needed.

  • Word-level confidence scores are not available for RNNT-based models (Parakeet 1.1b RNNT Multilingual, Nemotron ASR Streaming, and similar).

  • For best results with Nemotron ASR Streaming, include roughly 80 ms of leading silence in the input audio; starting with speech immediately can cause initial words to be missed.

  • The Nemotron ASR Streaming model is not supported on GB200 systems with NVLink fabric connectivity (sm100a).

TTS NIM#

Key Features and Enhancements#

  • Extended language support for Magpie TTS Multilingual to Hindi (hi-IN) and Japanese (ja-JP).

  • Added emotional voice variants (Angry, Calm, Fearful, Happy, Neutral, Sad, PleasantSurprised, Disgusted) for Magpie TTS Multilingual across supported languages.

  • Magpie TTS Multilingual supports DGX Spark platform (support extended from Riva TTS NIM Release 1.10.0).

  • Default model profile is now batch_size=8 for all hardware (removed batch_size=1 default for Blackwell).

  • Added performance benchmarks for Magpie TTS Multilingual on B200 and DGX Spark. Updated benchmarks for A100, H100, and L40.

Known Issues#

  • Audio prompts for zeroshot models (Magpie TTS Zeroshot and Magpie TTS Flow) must be mono, 16-bit WAV format at 22.05 kHz or higher, with a duration of 3–10 seconds.

Support Matrix and Compatibility Updates#

The following list summarizes the updated models and their support matrices:

To find the latest support matrix for the NVIDIA Speech NIM microservices, refer to Support Matrix.

Archived Documentation#

With the introduction of the NVIDIA Speech NIM microservices documentation beginning with release 26.02.0, the previous NVIDIA Riva NIM documentation has been officially deprecated. To access the deprecated documentation, refer to the following links: