Speech-to-Text Translation (S2T) Overview#

The NVIDIA Riva Speech-to-Text Translation (S2T) service transcribes audio to text between given language pairs, that is, from a source language to a target language. S2T takes an audio stream or audio buffer as input and returns a transcription. The Riva S2T service is composed of Riva ASR and NMT pipelines internally and supports streaming mode. Bilingual and multilingual models are trained using NVIDIA NeMo; a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR) and Natural Machine Translation (NMT) models.

Language Pairs Supported#

The NVIDIA Riva S2T service supports models for the following language pairs:

  1. Spanish (es) to English (en)

  2. German (de), Spanish (es), French (fr) to English (en)

  3. Simplified Chinese (zh) to English (en)

  4. Russian (ru) to English (en)

  5. German (de) to English (en)

  6. French (fr) to English (en)

Models Supported#

The S2T feature supports the following models for (ASR)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/asr.html] and (NMT)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/nmt.html].

Model Deployment#

Like all Riva models, Riva S2S requires the following steps:

  1. Create .riva files for each model (ASR, NMT) from a .nemo file as outlined in the NeMo section.

  2. Create .rmir files for each Riva Speech AI Skill (ASR, NMT) using riva-build.

  3. Create model directories using riva_deploy.

  4. Deploy the model directory using riva_server.

ASR models can be customized as shown in (ASR Customization Best Practices)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html] and (NMT)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/translation/translation-customization.html] models.

Multiple Deployed Models#

The Riva server supports multiple models deployed simultaneously for the S2S service, up to the limit of your GPUs memory. As such, a single-server process can host models for a variety of language pairs as outlined above.

The ASR model is determined by the --model_name parameter of the client request. This value must match the value of the riva-build parameter used to create the ASR model.

If a model name is not provided, the default ASR model is used. If you specify a model name that is already used, it will be overwritten. To get language pairs available on the server, use the ListSupportedLanguagePairs API. When receiving requests from the client application, the Riva server selects the deployed ASR model to use based on the protobuf object StreamingTranslateSpeechToTextConfig of the client request. In the case where multiple models might be able to fulfill the client request, one model is selected at random. You can also explicitly select which ASR model to use by setting the model_name field of StreamingTranslateSpeechToTextConfig. This enables you to deploy multiple S2S pipelines concurrently and select which one to use at runtime.

Punctuation and Inverse Text Normalization(ITN) with S2T#

The S2T service supports punctuation and Inverse Text Normalization (ITN) and can be enabled or disabled using the following

--automatic_punctuation when set to true(default) enables punctuation and --verbatim_transcripts when to false enables ITN.

BLEU Metric#

The (BLEU score)[https://en.wikipedia.org/wiki/BLEU] evaluates the quality of the Riva S2T pipeline. The S2T pipeline has a BLEU score of 27 when punctuation and ITN is enabled. The S2T pipeline has a BLEU score of 21.5 with punctuation and ITN disabled.