Translation Overview#

NVIDIA Riva translation is a framework based on neural networks. Riva translation translates text between language pairs, that is, from one language to another. For example, we want a machine to translate text in one language (we call this the source language), to corresponding text in another language (we call this the target language). Bilingual, multilingual, and Megatron models are trained using NVIDIA NeMo; a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models.

For more information about the translation model architecture and training, refer to the NeMo Machine Translation documentation.

Language Pairs Supported#

The NVIDIA Riva translation service supports models for these language pairs:

  1. English (en) to German (de), Spanish (es), French (fr)

  2. German (de), Spanish (es), French (fr) to English (en)

  3. English (en) to Simplified Chinese (zh)

  4. Simplified Chinese (zh) to English (en)

  5. English (en) to Russian (ru)

  6. Russian (ru) to English (en)

  7. English (en) to German (de)

  8. German (de) to English (en)

  9. English (en) to Spanish (es)

  10. Spanish (es) to English (en)

  11. English (en) to French (fr)

  12. French (fr) to English (en)

  13. English (en) to Danish (da)

  14. Danish (da) to English (en)

  15. English (en) to Greek (el)

  16. Greek (el) to English (en)

  17. English (en) to Finnish (fi)

  18. Finnish (fi) to English (en)

  19. English (en) to Hungarian (hu)

  20. Hungarian (hu) to English (en)

  21. English (en) to Italian (it)

  22. Italian (it) to English (en)

  23. English (en) to Lithuanian (lt)

  24. Lithuanian (lt) to English (en)

  25. English (en) to Latvian (lv)

  26. Latvian (lv) to English (en)

  27. English (en) to Dutch (nl)

  28. Dutch (nl) to English (en)

  29. English (en) to Norwegian (no)

  30. Norwegian (no) to English (en)

  31. English (en) to Polish (pl)

  32. Polish (pl) to English (en)

  33. English (en) to Portuguese (pt)

  34. Portuguese (pt) to English (en)

  35. English (en) to Romanian (ro)

  36. Romanian (ro) to English (en)

  37. English (en) to Slovak (sk)

  38. Slovak (sk) to English (en)

  39. English (en) to Swedish (sv)

  40. Swedish (sv) to English (en)

  41. English (en) to Japanese (ja)

  42. Japanese (ja) to English (en)

  43. English (en) to Hindi (hi)

  44. Hindi (hi) to English (en)

  45. English (en) to Korean (ko)

  46. Korean (ko) to English (en)

  47. English (en) to Estonian (et)

  48. Estonian (et) to English (en)

  49. English (en) to Slovenian (sl)

  50. Slovenian (sl) to English (en)

  51. English (en) to Bulgarian (bg)

  52. Bulgarian (bg) to English (en)

  53. English (en) to Ukrainian (uk)

  54. Ukrainian (uk) to English (en)

  55. English (en) to Croatian (hr)

  56. Croatian (hr) to English (en)

  57. English (en) to Arabic (ar)

  58. Arabic (ar) to English (en)

  59. English (en) to Vietnamese (vi)

  60. Vietnamese (vi) to English (en)

  61. English (en) to Turkish (tr)

  62. Turkish (tr) to English (en)

  63. English (en) to Indonesian (id)

  64. Indonesian (id) to English (en)

  65. English (en) to Czech (cs)

  66. Czech (cs) to English

Translation Features#

Riva translation currently provides an API to translate to and from language pairs using models trained in NeMo Machine Translation. There are two different model architectures supported:

  • Multilingual models like en_deesfr, which translates from English to German, Spanish, and French. Multilingual models have several language codes in their name. By default, use 24x6 multilingual models. Use a multilingual model if you need to support multiple languages or if you want to optimize resource utilization since you can translate along multiple language pairs without loading multiple bilingual models. Running multilingual models prevents loading multiple models, therefore, preventing overhead.

  • Bilingual models like en_ru, which translates from English to Russian. Bilingual models have a single pair of language codes in their name. Use a bilingual model when you want the best possible performance for a specific language pair direction. Running bilingual models produces better quality translations compared to the current multilingual models.

  • Megatron models like en_any and any_en, which translates from English to 32 languages. Megatron models can be translated into Danish, German, Greek, Spanish, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Chinese, Japanese, Hindi, Korean, Estonian, Slovenian, Bulgarian, Ukrainian, Croatian, Arabic, Vietnamese, Turkish, Indonesian, and Czech.

You can use 12x2 instead of a 24x6 multilingual model if you need to reduce the resource consumption even further to reduce latency and can accept a bit of translation quality degradation.

Note

You can switch to bilingual models if you feel the multilingual models performance is lacking on a specific language direction.

Riva translation enables you to batch multiple sentences together to provide a faster translation experience. Using the translation client, you can batch together up to 8 sentences to be translated in a single request. The batch size, which defaults to 8, can be adjusted using the batch_size parameter in the client.

Speech-to-Speech Translation (S2S)#

NVIDIA Riva Speech-to-Speech Translation (S2S) service translates audio between language pairs, that is, from one source language to another target language. S2S takes an audio stream or audio buffer as input and returns a generated audio file. The Riva S2S service is composed of Riva ASR, NMT, and TTS pipelines internally. The Riva S2S service supports streaming mode. Bilingual and multilingual models are trained using NVIDIA NeMo; a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Machine Translation (NMT), and Text-to-Speech (TTS) models.

S2S Models Supported#

The S2S feature supports the following models for ASR, NMT, and TTS.

Speech-to-Text Translation (S2T)#

The NVIDIA Riva Speech-to-Text Translation (S2T) service transcribes audio to text between given language pairs, that is, from a source language to a target language. S2T takes an audio stream or audio buffer as input and returns a transcription. The Riva S2T service is composed of Riva ASR and NMT pipelines internally and supports streaming mode. Bilingual and multilingual models are trained using NVIDIA NeMo; a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR) and Natural Machine Translation (NMT) models.

S2T Models Supported#

The S2T feature supports the following models for ASR and NMT.

Model Deployment#

Like all Riva models, Riva S2S and S2T requires the following steps:

  1. Create .riva files for each model from a .nemo file as outlined in the NeMo section.

  2. Create .rmir files for each Riva Speech AI Skill using riva-build.

  3. Create model directories using riva-deploy.

  4. Deploy the model directory using riva_server.

Models can be customized as shown in ASR Customization, NMT Custom Models, and TTS Custom Models.

Multiple Deployed Models#

The Riva server supports multiple models deployed simultaneously; up to the limit of your GPUs memory. As such, a single-server process can host models for a variety of language pairs as outlined above.

For the text translation client, the model name can be provided by using the --model_name parameter of the client request. This value must match the value of the riva-build parameter used to create the model. If a model name is not provided, it will be derived automatically from the provided source and target language pair.

To get models and language pairs available on the server, use the ListSupportedLanguagePairs API.

When receiving requests from the client application, the Riva server selects the deployed models to use based on the protobuf object StreamingTranslateSpeechToSpeechConfig (for S2S) and StreamingTranslateSpeechToTextConfig (for S2T) of the client request. In the case where multiple models might be able to fulfill the client request, one model is selected at random.

Punctuation and Inverse Text Normalization (ITN) with S2S and S2T#

The S2S and S2T services support punctuation and ITN. They can be enabled or disabled with the following parameters in the client options:

--automatic_punctuation when set to true(default) enables punctuation and --verbatim_transcripts when set to false enables ITN.

BLEU Metric#

The BLEU score evaluates the quality of the Riva pipeline.

The pipeline has a BLEU score of 27 when punctuation and ITN are enabled.

The pipeline has a BLEU score of 21.5 when punctuation and ITN are disabled.