Speech-to-Speech Translation (S2S) Overview#

NVIDIA Riva Speech-to-Speech Translation (S2S) service translates audio between language pairs, that is, from one source language to another target language. S2S takes an audio stream or audio buffer as input and returns a generated audio file. The Riva S2S service is composed of Riva ASR, NMT, and TTS pipelines internally. The Riva S2S service supports streaming mode. Bilingual and multilingual models are trained using NVIDIA NeMo; a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Machine Translation (NMT), and Text-to-Speech (TTS) models.

Language Pairs Supported#

The NVIDIA Riva S2S service supports models for the following language pairs:

  1. Spanish (es) to English (en)

  2. German (de), Spanish (es), French (fr) to English (en)

  3. Simplified Chinese (zh) to English (en)

  4. Russian (ru) to English (en)

  5. German (de) to English (en)

  6. French (fr) to English (en)

Models Supported#

The S2S feature supports the following models for (ASR)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/asr.html], (NMT)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/nmt.html], and (TTS)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/tts.html]:

Model Deployment#

Like all Riva models, Riva S2S requires the following steps:

  1. Create .riva files for each model (ASR, NMT, TTS) from a .nemo file as outlined in the NeMo section.

  2. Create .rmir files for each Riva Speech AI Skill (ASR, NMT, TTS) using riva-build.

  3. Create model directories using riva_deploy.

  4. Deploy the model directory using riva_server.

ASR models can be customized as shown in (ASR Customization Best Practices)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html], NMT models can be customized using [https://docs.nvidia.com/deeplearning/riva/user-guide/docs/translation/translation-customization.html], and TTS models can be customized using (Custom Models)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html].

Multiple Deployed Models#

The Riva server supports multiple models deployed simultaneously for the S2S service; up to the limit of your GPUs memory. As such, a single-server process can host models for a variety of language pairs as outlined above.

The ASR model is determined by the --model_name parameter of the client request. This value must match the value of the riva-build parameter used to create the ASR model.

If a model name is not provided, the default ASR model is used. If you specify a model name that is already used, it will be overwritten. To get language pairs available on the server, use the ListSupportedLanguagePairs API. When receiving requests from the client application, the Riva server selects the deployed ASR model to use based on the protobuf object StreamingTranslateSpeechToSpeechConfig of the client request. In the case where multiple models might be able to fulfill the client request, one model is selected at random. You can also explicitly select which ASR model to use by setting the model_name field of the StreamingTranslateSpeechToSpeechConfig. This enables you to deploy multiple S2S pipelines concurrently and select which one to use at runtime.

Punctuation and Inverse Text Normalization(ITN) with S2S#

S2S service supports punctuation and ITN and can be enabled or disabled through following parameters in the client options:

--automatic_punctuation when set to true(default) enables punctuation and --verbatim_transcripts when to false enables Inverse Text Normalization.