How to perform synthetic data generation using Riva NMT Multilingual model with Nvidia NeMo#

This tutorial walks you through how to perform synthetic data generation using a Riva NMT Multilingual model with Nvidia NeMo. The synthetic data generated in turn can be used for fine-tuning the models further.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • Neural Machine Translation (NMT)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will perform data generation using a Riva NMT Multilingual model with Nvidia NeMo.
To understand the basics of Riva NMT APIs, refer to the “How do I perform Language Translation using Riva NMT APIs with out-of-the-box models?” tutorial in Riva NMT Tutorials.

For more information about Riva, refer to the Riva developer documentation.
For more information about Riva NMT, refer to the Riva NMT documentation

NVIDIA NeMo Overview#

NVIDIA NeMo is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures.

For more information about NeMo, refer to the NeMo product page and documentation. The open-source NeMo repository can be found here.

Generating synthetic data using Riva NMT Multilingual model with NVIDIA NeMo#

For this tutorial, we will be using the Riva NMT Multilingual Any-to-En model on the Scielo English-Spanish dataset for generating data in french language.

The process of synthetic data generation here can be split into following steps:

  1. Requirements and Setup.

  2. Data preprocessing(may vary based on actual data you use, please follow fine-tuning tutorial for more detailed pre-processing).

  3. Running inference using the NMT model with NeMo.

  4. Refer to the fine-tuning tutorial for using this data to customize the OOTB model.

Let’s walk through each of these steps in detail.

Step 1. Requirements and Setup#

This tutorial needs to be run from inside a NeMo docker container. If you are not running this tutorial through a NeMo docker container, please refer to the Riva NMT Tutorials to get started.

Before we get into the Requirements and Setup, let us create a base directory for our work here.

import os
base_dir = "NMTSynDataGeneration"
!mkdir $base_dir
base_dir=os.path.abspath("NMTSynDataGeneration")
  1. Clone the NeMo github repository.

NeMoBranch = "v1.17.0_pt_23.04"
!git clone -b $NeMoBranch https://github.com/NVIDIA/NeMo $base_dir/NeMo
!apt-get update && apt-get install -y libsndfile1 ffmpeg
!pip3 install "cython<3.0.0" wheel && pip3 install pyyaml==5.4.1 --no-build-isolation
%cd $base_dir/NeMo
!./reinstall.sh
!pip install torchmetrics==0.11.4
%cd ..
  1. Check CUDA installation.

import torch
torch.cuda.is_available()
  1. Install Apex (if not using NeMo container)

!git clone https://github.com/NVIDIA/apex.git
%cd apex
!git checkout a32d7a6dddcf4e39d241b0d139c222a97c91887d
!pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
%cd ..

Data download#

Let us download the Scielo English-Spanish dataset. Specifically we are going to download the Moses’s version of the dataset, which consist of 2 files, en_es.en and en_es.es. Each newline-separated entry in the en_es.en file is a translation of the corresponding entry in the en_es.es file, and vice-versa.

data_dir = base_dir + "/data"
!mkdir $data_dir

# Download the Scielo dataset
!wget -P $data_dir https://figshare.com/ndownloader/files/14019287
# Untar the downloaded the Scielo dataset
!tar -xvf $data_dir/14019287 -C $data_dir

Step 2. Data preprocessing#

Data preprocessing consists of multiple steps to improve the quality of the dataset. NeMo documentation provides detailed instructions about the 8-step data preprocessing for NMT. NeMo also provides a jupyter notebook that takes users programatically through the different preprocessing steps. Note that depending on the dataset, some or all preprocessing steps can be skipped.

To simplify the process in the Riva NMT program, we are only performing lang id filtering before data generation to get rid of any noise that maybe present in raw dataset. The input to these scripts will be a parallel corpus (i.e., source and target language) data files. In this tutorial, we are using the Moses’ version of the Scielo dataset, which directly provides us the source (en_es.en) and target (en_es.es) data files. If the dataset does not directly provide these files, then we first need to generate these 2 files from the dataset before using the preprocessing scripts.

Language filtering#

The language filtering preprocessing script is used for verifying language in machine translation data sets, using the Fasttext Language Identification model. If the script is used on a parallel corpus, it verifies both a source and a target language. Filtered data is stored into the files specified by output_src and output-tgt, and the removed lines are put into the files specified by removed_src and removed-tgt. If language cannot be detected (e.g. date), the line is removed.

This script exposes a number of parameters, the most common of which are:

  • input-src: Path to the input file which contains text in source language.

  • input-tgt: Path to the input file which contains text in target language.

  • output-src: File path where the source language’s filtered data is to be saved.

  • output-tgt: File path where the target language’s filtered data is to be saved.

  • removed-src: File path where the discarded data from source language is to be saved.

  • removed-tgt: File path where the discarded data from target language is to be saved.

  • source-lang: Source language’s language code.

  • target-lang: Target language’s language code.

  • fasttext-model: Path to fasttext model. The description and download links are here.

# Let us first download the fasttext model.
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O $data_dir/lid.176.bin
# Running the language filtering preprocessing script.
!python $base_dir/NeMo/scripts/neural_machine_translation/filter_langs_nmt.py \
    --input-src $data_dir/en_es.en \
    --input-tgt $data_dir/en_es.es \
    --output-src $data_dir/en_es_preprocessed.en \
    --output-tgt $data_dir/en_es_preprocessed.es \
    --removed-src $data_dir/en_es_garbage.en \
    --removed-tgt $data_dir/en_es_garbage.es \
    --source-lang en \
    --target-lang es \
    --fasttext-model $data_dir/lid.176.bin

Download the ootb model to perform data generation#

# Create directory to hold model
model_dir = base_dir + "/model"
!mkdir $model_dir

# Download the NMT model from NGC using wget command
!wget -O $model_dir/megatronnmt_en_any_500m_1.0.0.zip --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/megatronnmt_en_any_500m/versions/1.0.0/zip 

# Unzip the downloaded model zip file.
!unzip -o $model_dir/megatronnmt_en_any_500m_1.0.0.zip -d $model_dir/pretrained_ckpt

# Alternate way to download the model from NGC using NGC CLI (Please make sure to install and setup NGC CLI):
#!cd $model_dir && ngc registry model download-version "nvidia/nemo/megatronnmt_any_en_500m:1.0.0"

Step 3. Running inference using the NMT model with NeMo for data generation#

!python $base_dir/NeMo/examples/nlp/machine_translation/nmt_transformer_infer_megatron.py \
     model_file=$model_dir/pretrained_ckpt/megatronnmt_en_any_500m.nemo \
     srctext=$data_dir/en_es_preprocessed.en \
     tgtout=$data_dir/en_fr2.fr \
     source_lang=en \
     target_lang=fr \
     batch_size=10 \
     trainer.precision=32
[NeMo W 2023-12-15 10:48:14 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-12-15 10:48:14 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-12-15 10:48:15 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
LexerNoViableAltException: \
                           ^
See https://hydra.cc/docs/next/advanced/override_grammar/basic for details

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Step 4. Refer to the fine-tuning tutorial for using this data to customize the OOTB model.#

Lastly, follow the steps in ” in Riva NMT Tutorials to use this data for customizing the OOTB model.