Speech-to-Speech Translation (S2S) Overview
Contents
Speech-to-Speech Translation (S2S) Overview#
NVIDIA Riva Speech-to-Speech Translation (S2S) service translates audio between language pairs, that is, from one source language to another target language. S2S takes an audio stream or audio buffer as input and returns a generated audio file. The Riva S2S service is composed of Riva ASR, NMT, and TTS pipelines internally. The Riva S2S service supports streaming mode. Bilingual and multilingual models are trained using NVIDIA NeMo; a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Machine Translation (NMT), and Text-to-Speech (TTS) models.
Language Pairs Supported#
The NVIDIA Riva S2S service supports models for the following language pairs:
Spanish (es) to English (en)
German (de), Spanish (es), French (fr) to English (en)
Simplified Chinese (zh) to English (en)
Russian (ru) to English (en)
German (de) to English (en)
French (fr) to English (en)
Models Supported#
The S2S feature supports the following models for (ASR)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/asr.html], (NMT)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/nmt.html], and (TTS)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/models/tts.html]:
Model Deployment#
Like all Riva models, Riva S2S requires the following steps:
Create
.rivafiles for each model (ASR, NMT, TTS) from a.nemofile as outlined in the NeMo section.Create
.rmirfiles for each Riva Speech AI Skill (ASR, NMT, TTS) usingriva-build.Create model directories using
riva_deploy.Deploy the model directory using
riva_server.
ASR models can be customized as shown in (ASR Customization Best Practices)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-customizing.html], NMT models can be customized using [https://docs.nvidia.com/deeplearning/riva/user-guide/docs/translation/translation-customization.html], and TTS models can be customized using (Custom Models)[https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html].
Multiple Deployed Models#
The Riva server supports multiple models deployed simultaneously for the S2S service; up to the limit of your GPUs memory. As such, a single-server process can host models for a variety of language pairs as outlined above.
The ASR model is determined by the --model_name parameter of the client request. This value must match the value of the riva-build parameter used to create the ASR model.
If a model name is not provided, the default ASR model is used. If you specify a model name that is already used,
it will be overwritten.
To get language pairs available on the server, use the ListSupportedLanguagePairs API.
When receiving requests from the client application, the Riva server selects the deployed ASR model
to use based on the protobuf object StreamingTranslateSpeechToSpeechConfig of the client request.
In the case where multiple models might be able to fulfill the
client request, one model is selected at random. You can also explicitly select which ASR model
to use by setting the model_name field of the StreamingTranslateSpeechToSpeechConfig.
This enables you to deploy multiple S2S pipelines concurrently and select which one to use at runtime.
Punctuation and Inverse Text Normalization(ITN) with S2S#
S2S service supports punctuation and ITN and can be enabled or disabled through following parameters in the client options:
--automatic_punctuation when set to true(default) enables punctuation and --verbatim_transcripts
when to false enables Inverse Text Normalization.