How to fine-tune a Riva NMT Bilingual model with Nvidia NeMo#

This tutorial walks you through how to fine-tune a Riva NMT Bilingual model with Nvidia NeMo.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • Neural Machine Translation (NMT)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will fine-tune a Riva NMT Bilingual model with Nvidia NeMo.
To understand the basics of Riva NMT APIs, refer to the “How do I perform Language Translation using Riva NMT APIs with out-of-the-box models?” tutorial in Riva NMT Tutorials.

For more information about Riva, refer to the Riva developer documentation.
For more information about Riva NMT, refer to the Riva NMT documentation

NVIDIA NeMo Overview#

NVIDIA NeMo is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures.

For more information about NeMo, refer to the NeMo product page and documentation. The open-source NeMo repository can be found here.

Fine-tuning Riva NMT Bilingual model with NVIDIA NeMo#

For this tutorial, we will be fine-tuning the Riva NMT Bilingual English-to-Spanish model on the Scielo English-Spanish dataset.

This tutorial covers fine-tuning only the NMT Bilingual model. Fine-tuning a Multilingual model is a relatively more challenging task (like choosing a balanced dataset covering multiple languages), and a tutorial covering it will be published in a future release.

The process of fine-tuning here can be split into four steps:

  1. Data download.

  2. Data preprocessing.

  3. Fine-tuning the NMT model with NeMo.

  4. Evaluate the fine-tuned NMT model with NeMo.

  5. Exporting the NeMo model

  6. Deploying the fine-tuned NeMo NMT model on the Riva Speech Skills server.

Let’s walk through each of these steps in detail.

Requirements and Setup#

This tutorial needs to be run from inside a NeMo docker container. If you are not running this tutorial through a NeMo docker container, please refer to the Riva NMT Tutorials’s README.md to get started.

Before we get into the Requirements and Setup, let us create a base directory for our work here.

base_dir = "NMTFinetuning"
!mkdir $base_dir
  1. Clone the NeMo github repository.

NeMoBranch = "main"
!git clone -b $NeMoBranch https://github.com/NVIDIA/NeMo $base_dir/NeMo

Check CUDA installation.

import torch
torch.cuda.is_available()
WARNING: You may need to install `apex`.
!git clone https://github.com/ericharper/apex.git
!cd apex
!git checkout nm_v1.15.0
!pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
  1. Install the nemo2riva library from the Riva Quick Start Guide.

# Install the `nemo2riva` library
!python3 -m pip install nemo2riva
  1. Install additional libraries required for this tutorial.

!python3 -m pip install scikit-learn

Step 1. Data download#

Let us download the Scielo English-Spanish dataset. Specifically we are going to download the Moses’s version of the dataset, which consists of 2 files, en_es.en and en_es.es. Each newline-separated entry in the en_es.en file is a translation of the corresponding entry in the en_es.es file, and vice-versa.

data_dir = base_dir + "/data"
!mkdir $data_dir

# Download the Scielo dataset
!wget -P $data_dir https://figshare.com/ndownloader/files/14019287
# Untar the downloaded the Scielo dataset
!tar -xvf $data_dir/14019287 -C $data_dir

Step 2. Data preprocessing#

Data preprocessing consists of multiple steps to improve the quality of the dataset. NeMo documentation provides detailed instructions about the 8-step data preprocessing for NMT. NeMo also provides a jupyter notebook that takes users programatically through the different preprocessing steps. Note that depending on the dataset, some or all preprocessing steps can be skipped.

To simplify the fine-tuning process in the Riva NMT program, we have provided 3 preprocessing scripts through the NeMo repository. The input to these scripts will be the 2 parallel corpus (i.e., source and target language) data files. In this tutorial, we are using the Moses’ version of the Scielo dataset, which directly provides us the source (en_es.en) and target (en_es.es) data files. If the dataset does not directly provide these files, then we first need to generate these 2 files from the dataset before using the preprocessing scripts.

Language filtering#

The language filtering preprocessing script is used for verifying language in machine translation data sets, using the Fasttext Language Identification model. If the script is used on a parallel corpus, it verifies both a source and a target language. Filtered data is stored into the files specified by output_src and output-tgt, and the removed lines are put into the files specified by removed_src and removed-tgt. If language cannot be detected (e.g. date), the line is removed.

This script exposes a number of parameters, the most common of which are:

  • input-src: Path to the input file which contains text in source language.

  • input-tgt: Path to the input file which contains text in target language.

  • output-src: File path where the source language’s filtered data is to be saved.

  • output-tgt: File path where the target language’s filtered data is to be saved.

  • removed-src: File path where the discarded data from source language is to be saved.

  • removed-tgt: File path where the discarded data from target language is to be saved.

  • source-lang: Source language’s language code.

  • target-lang: Target language’s language code.

  • fasttext-model: Path to fasttext model. The description and download links are here.

# Let us first download the fasttext model.
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O $data_dir/lid.176.bin
# Running the language filtering preprocessing script.
!python $base_dir/NeMo/scripts/neural_machine_translation/filter_langs_nmt.py \
    --input-src $data_dir/en_es.en \
    --input-tgt $data_dir/en_es.es \
    --output-src $data_dir/en_es_preprocessed1.en \
    --output-tgt $data_dir/en_es_preprocessed1.es \
    --removed-src $data_dir/en_es_garbage1.en \
    --removed-tgt $data_dir/en_es_garbage1.es \
    --source-lang en \
    --target-lang es \
    --fasttext-model $data_dir/lid.176.bin

Length filtering#

The length filtering script is a multi-processed script, for filtering a parallel corpus to remove sentences that are less than a minimum length or longer than a maximum length. It also filters based on the length ratio between source and target sentences.

This script exposes a number of parameters, the most common of which are:

  • input-src: Path to the input file which contains text in source language.

  • input-tgt: Path to the input file which contains text in target language.

  • output-src: File path where the source language’s filtered data is to be saved.

  • output-tgt: File path where the target language’s filtered data is to be saved.

  • removed-src: File path where the discarded data from source language is to be saved.

  • min-length: Minimum sequence length.

  • max-length: Maximum sequence length.

  • ratio: Ratio of the length of the source sentence to the length of the target sentence.

# Running the length filtering preprocessing script.
!python $base_dir/NeMo/scripts/neural_machine_translation/length_ratio_filter.py \
    --input-src $data_dir/en_es_preprocessed1.en \
    --input-tgt $data_dir/en_es_preprocessed1.es \
    --output-src $data_dir/en_es_preprocessed2.en \
    --output-tgt $data_dir/en_es_preprocessed2.es \
    --removed-src $data_dir/en_es_garbage2.en \
    --removed-tgt $data_dir/en_es_garbage2.es \
    --min-length 1 \
    --max-length 512 \
    --ratio 1.3

Tokenization and Normalization#

The tokenization and normalization script normalizes and tokenizes the input source and target language data.

This script exposes a number of parameters, the most common of which are:

  • input-src: Path to the input file which contains text in source language.

  • input-tgt: Path to the input file which contains text in target language.

  • output-src: File path where the normalized and tokenized source language’s data is to be saved.

  • output-tgt: File path where the normalized and tokenized target language’s data is to be saved.

  • source-lang: Source language’s language code.

  • target-lang: Target language’s language code.

!python $base_dir/NeMo/scripts/neural_machine_translation/preprocess_tokenization_normalization.py \
    --input-src $data_dir/en_es_preprocessed2.en \
    --input-tgt $data_dir/en_es_preprocessed2.es \
    --output-src $data_dir/en_es_final.en \
    --output-tgt $data_dir/en_es_final.es \
    --source-lang en \
    --target-lang es

Training, Dev and Validation split#

For the last step of data preprocessing, we are going to split our dataset into training, dev and validation sets.
This is an optional step - Many datasets already come with training, dev and validation splits, but the Scielo dataset we are using in this tutorial does not come with such a split. So we will be using scikit-learn to split our dataset.

"""
    Read en_es_final.en and en_es_final.es files into memory
"""
def read_data_from_file(filename):
    with open(filename) as f:
        lines = f.readlines()
    return lines
    
en_es_final_en = read_data_from_file(data_dir + "/en_es_final.en")
en_es_final_es = read_data_from_file(data_dir + "/en_es_final.es")

print("Number of entries in the final Scielo English-Spanish dataset = ", len(en_es_final_en))
"""
    Split the dataset into train, test and val using scikit learn's train_test_split
"""
from sklearn.model_selection import train_test_split

test_ratio = 0.10
validation_ratio = 0.05
train_ratio = 1.0 - validation_ratio - test_ratio

en_es_final_en_trainval, en_es_final_en_test, en_es_final_es_trainval, en_es_final_es_test = \
    train_test_split(en_es_final_en, en_es_final_es, test_size=test_ratio, random_state=1)

en_es_final_en_train, en_es_final_en_val, en_es_final_es_train, en_es_final_es_val = \
    train_test_split(en_es_final_en_trainval, en_es_final_es_trainval, test_size=validation_ratio, random_state=1)

print("Number of entries in the final Scielo English-Spanish training dataset = ", len(en_es_final_en_train))
print("Number of entries in the final Scielo English-Spanish validation dataset = ", len(en_es_final_en_val))
print("Number of entries in the final Scielo English-Spanish testing dataset = ", len(en_es_final_en_test))
"""
    Write the train, test and val data into files
"""
en_es_final_en_train_filename = "en_es_final_train.en"
en_es_final_en_val_filename = "en_es_final_val.en"
en_es_final_en_test_filename = "en_es_final_test.en"
en_es_final_es_train_filename = "en_es_final_train.es"
en_es_final_es_val_filename = "en_es_final_val.es"
en_es_final_es_test_filename = "en_es_final_test.es"

en_es_final_en_train_filepath = data_dir + "/" + en_es_final_en_train_filename
en_es_final_en_val_filepath = data_dir + "/" + en_es_final_en_val_filename
en_es_final_en_test_filepath = data_dir + "/" + en_es_final_en_test_filename
en_es_final_es_train_filepath = data_dir + "/" + en_es_final_es_train_filename
en_es_final_es_val_filepath = data_dir + "/" + en_es_final_es_val_filename
en_es_final_es_test_filepath = data_dir + "/" + en_es_final_es_test_filename

def write_data_to_file(data, filename):
    f = open(filename, "w")
    for data_entry in data:
        f.write(data_entry)
    f.close()
    
write_data_to_file(en_es_final_en_train, en_es_final_en_train_filepath)
write_data_to_file(en_es_final_en_val, en_es_final_en_val_filepath)
write_data_to_file(en_es_final_en_test, en_es_final_en_test_filepath)
write_data_to_file(en_es_final_es_train, en_es_final_es_train_filepath)
write_data_to_file(en_es_final_es_val, en_es_final_es_val_filepath)
write_data_to_file(en_es_final_es_test, en_es_final_es_test_filepath)    

Step 3. Fine-tuning the NMT model with NeMo.#

NeMo provides the finetuning script needed to fine tune a bilingual NMT NeMo model. We can use this script to launch training.

We start by downloading the out-of-the-box (OOTB) English to Spanish NMT NeMo model from NGC. It is this model, that we will be fine-tuning on the Scielo dataset.

# Create directory to hold model
model_dir = base_dir + "/model"
!mkdir $model_dir

# Download the NMT model from NGC using wget command
!wget -O $model_dir/nmt_en_es_transformer24x6_1.5.zip --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_en_es_transformer24x6/versions/1.5/zip
# Unzip the downloaded model zip file.
!unzip $model_dir/nmt_en_es_transformer24x6_1.5.zip -d $model_dir/pretrained_ckpt

# Alternate way to download the model from NGC using NGC CLI (Please make sure to install and setup NGC CLI):
#!cd $model_dir && ngc registry model download-version "nvidia/nemo/nmt_en_es_transformer24x6:1.5"

The NeMo NMT finetuning script exposes a number of parameters:

  • model_path: Path to the local OOTB .nemo model.

  • trainer.devices: Number of gpus to allocate for finetuning.

  • trainer.max_epochs: The maximum number of epochs to run finetuning for.

  • trainer.max_steps: The maximum number of steps to run finetuning for. max_steps can override max_epochs, as we do in this tutorial.

  • trainer.val_check_interval: This parameter decides the number of training steps to perform before running validation on the entire validation dataset.

  • model.train_ds.tgt_file_name: Path to the training dataset’s target language’s data file. In our case, this is the en_es_final_train.es file.

  • model.train_ds.src_file_name: Path to the training dataset’s source language’s data file. In our case, this is the en_es_final_train.en file.

  • model.train_ds.tokens_in_batch: Number of tokens in a single training batch. Pls note that this is not the number of data entries in a training batch, but the number of tokens.

  • model.validation_ds.tgt_file_name: Path to the validation dataset’s target language’s data file. In our case, this is the en_es_final_val.es file.

  • model.validation_ds.src_file_name: Path to the validation dataset’s source language’s data file. In our case, this is the en_es_final_val.en file.

  • model.validation_ds.tokens_in_batch: Number of tokens in a single batch during validation. Please note that the validation runs over the entire validation dataset - This parameter only specifies the number of tokens in a single batch. Multiple batches of data can be run to cover the entire validation dataset.

  • model.test_ds.tgt_file_name: Path to the test dataset’s target language’s data file. In our case, this is the en_es_final_test.es file.

  • model.test_ds.src_file_name: Path to the test dataset’s source language’s data file. In our case, this is the en_es_final_test.en file.

  • exp_manager.exp_dir: Path to the experiment directory, which serves as the working directory for NeMo finetuning.

  • exp_manager.checkpoint_callback_params.monitor: The metric to monitor.

  • exp_manager.checkpoint_callback_params.mode: The mode of the metrics to monitor.

  • exp_manager.checkpoint_callback_params.save_best_model: Flag to indicate whether the best model must be saved after each training step.

!python $base_dir/NeMo/examples/nlp/machine_translation/enc_dec_nmt_finetune.py \
      model_path=$model_dir/pretrained_ckpt/en_es_24x6.nemo \
      trainer.devices=1 \
      ~trainer.max_epochs \
      +trainer.max_steps=1 \
      +trainer.val_check_interval=1 \
      model.train_ds.tgt_file_name=$en_es_final_es_train_filepath \
      model.train_ds.src_file_name=$en_es_final_en_train_filepath \
      model.train_ds.tokens_in_batch=1280 \
      model.validation_ds.tgt_file_name=$en_es_final_es_val_filepath \
      model.validation_ds.src_file_name=$en_es_final_en_val_filepath \
      model.validation_ds.tokens_in_batch=2000 \
      model.test_ds.tgt_file_name=$en_es_final_es_test_filepath \
      model.test_ds.src_file_name=$en_es_final_en_test_filepath \
      +exp_manager.exp_dir=$model_dir/results/finetune-test \
      +exp_manager.create_checkpoint_callback=True \
      +exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU \
      +exp_manager.checkpoint_callback_params.mode=max \
      +exp_manager.checkpoint_callback_params.save_best_model=true
!python $base_dir/NeMo/examples/nlp/machine_translation/enc_dec_nmt_finetune.py \
      model_path=$model_dir/pretrained_ckpt/en_es_24x6.nemo \
      trainer.devices=1 \
      ~trainer.max_epochs \
      +trainer.max_steps=1 \
      +trainer.val_check_interval=1 \
      model.train_ds.tgt_file_name=$en_es_final_es_train_filepath \
      model.train_ds.src_file_name=$en_es_final_en_train_filepath \
      model.train_ds.tokens_in_batch=1280 \
      model.validation_ds.tgt_file_name=$en_es_final_es_val_filepath \
      model.validation_ds.src_file_name=$en_es_final_en_val_filepath \
      model.validation_ds.tokens_in_batch=2000 \
      model.test_ds.tgt_file_name=$en_es_final_es_test_filepath \
      model.test_ds.src_file_name=$en_es_final_en_test_filepath \
      +exp_manager.exp_dir=$model_dir/results/finetune-test \
      +exp_manager.create_checkpoint_callback=True \
      +exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU \
      +exp_manager.checkpoint_callback_params.mode=max \
      +exp_manager.checkpoint_callback_params.save_best_model=true

Step 4. Evaluate the fine-tuned NMT model with NeMo.#

Now that we have a finetuned model, we need to check how well it performs.
We run inference with a NeMo provided script nmt_transformer_infer.py, on a small subset of the test dataset, first with the OOTB model and then with the fine-tuned model. Then we compare the translations from both models.

The NeMo inference script nmt_transformer_infer.py supports multiple input parameters, the most important of which are:

  • model: Path to the .nemo to run inference on

  • srctext: Path to the text file containing new-line separated input samples to run inference on

  • tgtout: Path to the text file where translations are to be saved

  • source_lang: Source language’s language code.

  • target_lang: Target language’s language code.

  • batch_size: Batch size for inference In this section, we learn to run inference with this script.

First, let us create a working directory for evaluation.

eval_dir = base_dir + "/eval"
!mkdir $eval_dir

We pick a small subset of the test data for inference and write it into a file.

infer_input_data_en = en_es_final_en_test[:10]
infer_input_data_es = en_es_final_es_test[:10]

infer_input_data_en_filename = "infer_input_data_en.en"
infer_input_data_en_filepath = eval_dir + "/" + infer_input_data_en_filename

f = open(infer_input_data_en_filepath, "w")
for infer_input_data_en_entry in infer_input_data_en:
    f.write(infer_input_data_en_entry)
f.close()    

Let us run inference on the NeMo NMT OOTB model.

infer_ootbmodel_output_data_es_filename = "infer_ootbmodel_output_data_es.es"
infer_ootbmodel_output_data_es_filepath = eval_dir + "/" + infer_ootbmodel_output_data_es_filename

!python $base_dir/NeMo/examples/nlp/machine_translation/nmt_transformer_infer.py \
    --model $model_dir/pretrained_ckpt/en_es_24x6.nemo \
    --srctext $infer_input_data_en_filepath \
    --tgtout $infer_ootbmodel_output_data_es_filepath \
    --source_lang en \
    --target_lang es \
    --batch_size 10

Now we run inference on the NeMo NMT finetuned model.
Please be sure to set the model parameter below to point the finetuned .nemo checkpoint, that can be found in the $model_dir/results directory.

infer_finetuned_output_data_es_filename = "infer_finetuned_output_data_es.es"
infer_finetuned_output_data_es_filepath = eval_dir + "/" + infer_finetuned_output_data_es_filename

!python $base_dir/NeMo/examples/nlp/machine_translation/nmt_transformer_infer.py \
    --model $model_dir/pretrained_ckpt/en_es_24x6.nemo \
    --srctext $infer_input_data_en_filepath \
    --tgtout $infer_finetuned_output_data_es_filepath \
    --source_lang en \
    --target_lang es \
    --batch_size 10

Let us display the translations from both OOTB and finetuned models for our inference test subset.

with open(infer_ootbmodel_output_data_es_filepath) as f:
    infer_ootbmodel_output_data_es = f.readlines()

with open(infer_finetuned_output_data_es_filepath) as f:
    infer_finetuned_output_data_es = f.readlines()
    
for infer_input_data_en_entry, infer_input_data_es_entry, infer_ootbmodel_output_data_es_entry, infer_finetuned_output_data_es_entry in \
    zip(infer_input_data_en, infer_input_data_es, infer_ootbmodel_output_data_es, infer_finetuned_output_data_es):
    print("English: ", infer_input_data_en_entry)
    print("Spanish Translation - Ground Truth: ", infer_input_data_es_entry)
    print("Spanish Translation - OOTB model Generated:     ", infer_ootbmodel_output_data_es_entry)
    print("Spanish Translation - Finetuned model Generated:", infer_finetuned_output_data_es_entry)
    print("------------------------")

As can be seen above, the finetuned NMT model generated more accurate translations than the OOTB model on the test set of the Scielo dataset.

Step 5. Exporting the NeMo model#

NeMo and Riva allow you to export your fine-tuned model in a format that can deployed using NVIDIA Riva; a highly performant application framework for multi-modal conversational AI services using GPUs.

Export to Riva#

Riva provides the nemo2riva tool which can be used to convert a .nemo model to a .riva model. This tool is available through the Riva Quick Start Guide, and was installed during the Requirements and Setup step above.

!nemo2riva --out $model_dir/en_es_24x6.riva $model_dir/results/finetune-test/AAYNBaseFineTune/2023-02-24_06-43-56/checkpoints/AAYNBaseFineTune.nemo

Step 6. Deploying the fine-tuned NeMo NMT model on the Riva Speech Skills server.#

The NeMo-finetuned NMT model needs to be deployed on Riva Speech Skills server for inference.
Please follow the “How to deploy a NeMo-finetuned NMT model on Riva Speech Skills server?” tutorial from Riva NMT Tutorials - This notebook covers deploying the .riva file obtained from Step 5, on Riva Speech Skills server.