NVIDIA Speech NIM Microservices Release Notes#
This page lists changes, fixes, and known issues for each NVIDIA Speech NIM microservices release.
All Speech NIM microservice updates are released together as a collection and follow calendar versioning YY.MM.n, where YY is the year, MM is the month, and n is the patch number within that cycle.
Release 26.05.0#
NIM Container Versions#
NIM |
Container Tag |
|---|---|
Nemotron ASR Streaming |
|
Chatterbox TTS Multilingual |
Highlights#
Added runtime and deployment customization guidance for ASR word boosting, end-of-utterance handling, and TTS emotion exaggeration.
Added Chatterbox TTS Multilingual, a community text-to-speech model with streaming and offline inference support.
Added multilingual support to the Nemotron ASR Streaming NIM. Two model types are now available:
type=en-US: English (US) only (existing behavior).type=multi: Supports 40 language locales with automatic language detection. Omitlanguage_codefor automatic detection, or pass a specific code (for example,fr-FR) to constrain decoding. See Supported Languages by Model Type for the full list.
Updated ASR and TTS performance data for supported GPUs and added new Chatterbox TTS Multilingual performance tables.
ASR NIM#
Key Features#
Added global word boosting guidance for RNNT, TDT, and Nemotron ASR models, including setup instructions and latency guidance for large boosted word lists.
Added a word boosting latency comparison for per-stream and global word boosting.
Added guidance for detecting finalized streaming transcripts with
is_finaland for enabling interim transcripts with--show-intermediate.Added
force_eouruntime configuration guidance for Nemotron ASR Streaming, which lets clients force end-of-utterance finalization without closing the stream.Updated Nemotron ASR Streaming deployment guidance to use
name=nemotron-asr-streamingprofile selection.Doubled the maximum concurrent streams supported by Nemotron ASR Streaming on H100. The new
batch_size=64profile sustains up to 512 streams, compared to 256 streams in 1.0.0.Added the
type=multiNemotron ASR Streaming variant, which covers 40 language locales across three quality tiers:Transcription-ready (19 locales): Highest-accuracy ASR, ready out of the box: Arabic (ar-AR), Dutch (nl-NL), English UK (en-GB), English US (en-US), French Canada (fr-CA), French France (fr-FR), German (de-DE), Hindi India (hi-IN), Italian (it-IT), Japanese (ja-JP), Korean (ko-KR), Portuguese Brazil (pt-BR), Portuguese Portugal (pt-PT), Russian (ru-RU), Spanish Spain (es-ES), Spanish US (es-US), Turkish (tr-TR), Ukrainian (uk-UA), Vietnamese (vi-VN).
Broad-coverage (13 locales): Production ASR across an additional 13 locales: Bulgarian (bg-BG), Chinese Simplified (zh-CN), Croatian (hr-HR), Czech (cs-CZ), Danish (da-DK), Estonian (et-EE), Finnish (fi-FI), Hungarian (hu-HU), Norwegian Bokmål (nb-NO), Polish (pl-PL), Romanian (ro-RO), Slovak (sk-SK), Swedish (sv-SE).
Adaptation-ready (8 locales): Recognized by the tokenizer and designed for fine-tuning on in-domain data to unlock full transcription: Greek (el-GR), Hebrew (he-IL), Latvian (lv-LV), Lithuanian (lt-LT), Maltese (mt-MT), Norwegian Nynorsk (nn-NO), Slovenian (sl-SI), Thai (th-TH).
TTS NIM#
Key Features and Enhancements#
Added Chatterbox TTS Multilingual, which supports 23 languages with a single built-in default speaker per locale.
Added per-request emotion exaggeration customization for Chatterbox TTS Multilingual with the
exaggeration_factorcustom configuration parameter.Added Chatterbox TTS Multilingual performance benchmarks for A100, B200, H100, and L40 GPUs.
Known Issues#
Chatterbox TTS Multilingual quality is strongest for English. Non-English languages can have pronunciation issues, mixed-language output, hallucinated words, and speed-up artifacts.
Support Matrix and Compatibility Updates#
The following list summarizes the updated models and their support matrices:
Updated profiles, memory requirements, customization support, and performance data for the following ASR model:
Nemotron ASR Streaming: Added
type=multirows alongside the existingtype=en-USrows.
Updated language support, voice catalog, customization support, and performance data for the following TTS models:
To find the latest support matrix for the NVIDIA Speech NIM microservices, refer to Support Matrix.
Release 26.02.0#
NIM Container Versions#
NIM |
Container Tag |
|---|---|
Parakeet 1.1b CTC en-US |
|
Parakeet 1.1b RNNT Multilingual |
|
Parakeet 0.6b TDT en-US |
|
Magpie TTS Multilingual |
Highlights#
Consolidated the previously independent NVIDIA Riva ASR, TTS, and NMT NIMs into a single collection, NVIDIA Speech NIM Microservices, that follows unified calendar versioning (YY.MM.n).
Launched the new NVIDIA Speech NIM microservices documentation with the consolidation and renaming of the NVIDIA Riva ASR, TTS, and NMT NIMs. This documentation is a comprehensive guide to NVIDIA Speech NIM microservices for ASR, TTS, and NMT.
ASR NIM#
Key Features#
Renamed and expanded Parakeet 0.6b TDT to support two model types: English-only (
type=default, parakeet-tdt-0.6b-v2) and multilingual (type=multi, parakeet-tdt-0.6b-v3) with 25 European languages. UseCONTAINER_ID=parakeet-0.6b-tdtand language codemultifor auto language detection.Added three model types to Parakeet 1.1b RNNT Multilingual: Default (auto language detection), Prompt (improved accuracy, client-specified language), and Indic (optimized for Indic languages). Expanded language support table including Bengali (bn-IN), Tamil (ta-IN).
Extended word boosting to Parakeet TDT and Parakeet RNNT models. RNNT/TDT use boost score range 0.5–2.0 (CTC uses 20–100). Added custom pronunciation using word boosting with explicit tokenization for CTC models.
Improved latency and throughput for Silero VAD and Sortformer diarizer for Parakeet 1.1b CTC and Parakeet 1.1b RNNT NIMs.
Added VAD-based end-of-utterance detection for Parakeet 1.1b RNNT NIM.
Added Nemotron ASR Streaming NIM, which supports streaming mode only.
Known Issues#
The Parakeet 1.1b RNNT Multilingual model generates spaces after every character in the transcript for languages such as Japanese. To generate output without spaces, pass the
language_code=ja-JPparameter from the client.The Parakeet 1.1b RNNT Multilingual model has speaker diarization enabled for all profiles. The
mode=allprofiles up to 50 GB of GPU memory. For GPUs with lower memory, deploy only one or two modes instead of all modes.Transducer models (Parakeet RNNT, Parakeet TDT) can emit identical start/end timestamps for words when multiple tokens share the same timestamp.
Punctuation output from the Nemotron ASR Streaming model is not always consistent. For strict punctuation, implement post-processing as needed.
Word-level confidence scores are not available for RNNT-based models (Parakeet 1.1b RNNT Multilingual, Nemotron ASR Streaming, and similar).
For best results with Nemotron ASR Streaming, include roughly 80 ms of leading silence in the input audio; starting with speech immediately can cause initial words to be missed.
The Nemotron ASR Streaming model is not supported on GB200 systems with NVLink fabric connectivity (sm100a).
TTS NIM#
Key Features and Enhancements#
Extended language support for Magpie TTS Multilingual to Hindi (hi-IN) and Japanese (ja-JP).
Added emotional voice variants (Angry, Calm, Fearful, Happy, Neutral, Sad, PleasantSurprised, Disgusted) for Magpie TTS Multilingual across supported languages.
Magpie TTS Multilingual supports DGX Spark platform (support extended from Riva TTS NIM Release 1.10.0).
Default model profile is now
batch_size=8for all hardware (removedbatch_size=1default for Blackwell).Added performance benchmarks for Magpie TTS Multilingual on B200 and DGX Spark. Updated benchmarks for A100, H100, and L40.
Known Issues#
Audio prompts for zeroshot models (Magpie TTS Zeroshot and Magpie TTS Flow) must be mono, 16-bit WAV format at 22.05 kHz or higher, with a duration of 3–10 seconds.
Support Matrix and Compatibility Updates#
The following list summarizes the updated models and their support matrices:
Updated profiles, memory requirements, and customization support for the following ASR models:
Updated language support, voice catalog with emotional variants, batch size defaults, and DGX Spark support for the following TTS models:
To find the latest support matrix for the NVIDIA Speech NIM microservices, refer to Support Matrix.
Archived Documentation#
With the introduction of the NVIDIA Speech NIM microservices documentation beginning with release 26.02.0, the previous NVIDIA Riva NIM documentation has been officially deprecated. To access the deprecated documentation, refer to the following links: