How to Fine-Tune a Riva ASR Acoustic Model with NVIDIA NeMo#

This tutorial walks you through how to fine-tune an NVIDIA Riva ASR acoustic model with NVIDIA NeMo.

Important: If you plan to fine-tune an ASR acoustic model using the same tokenizer with which the model was trained, skip this tutorial and refer to the “Sub-word Encoding CTC Model” section (starting with the “Load pre-trained model” subsection) of the NeMo ASR Language Finetuning tutorial.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding (NLU) services such as:

Automated speech recognition (ASR).
Text-to-Speech synthesis (TTS).
A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will fine-tune a Riva ASR acoustic model with NeMo.
To understand the basics of Riva ASR APIs, refer to Getting started with Riva ASR in Python.

For more information about Riva, refer to the Riva developer documentation.

NeMo (Neural Modules)#

NVIDIA NeMo is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and NLU models with a simple Python interface. For information about how to set up NeMo, refer to the NeMo GitHub instructions.

"""
You can run either this tutorial locally (if you have all the dependencies and a GPU) or on Google Colab.

Perform the following steps to setup in Google Colab:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub.
   a. Click **File** > **Upload Notebook** > **GITHUB** tab > copy/paste the GitHub URL.
3. Connect to an instance with a GPU.
   a. Click **Runtime** > Change the runtime type > select **GPU** for the hardware accelerator.
4. Run this cell to set up the dependencies.
5. Restart the runtime.
   a. Click **Runtime** > **Restart Runtime** for any upgraded packages to take effect.
"""

# Install Dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

## Install NeMo
BRANCH = 'v1.23.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, in the case where you want to use the "Run All Cells" (or similar) option, 
uncomment `exit()` below to crash and restart the kernel.
"""
# exit()

Fine-Tuning an ASR model with NeMo#

Download the Data#

In this tutorial, we will use the popular AN4 dataset. Let’s download it.

! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

After downloading, untar the dataset and move it to the correct directory.

import os
DATA_DIR = os.getcwd()
os.environ["DATA_DIR"] = DATA_DIR
! tar -xvf an4_sphere.tar.gz 
! mv an4 $DATA_DIR

Pre-Processing#

This step converts the .mp3 files into .wav files and splits the data into training and testing sets. It also generates a “meta-data” file to be consumed by the data-loader for training and testing.

import json, librosa, os, glob
import subprocess


source_data_dir = f"{DATA_DIR}/an4"
target_data_dir = f"{DATA_DIR}/an4_converted"

def an4_build_manifest(transcripts_path, manifest_path, target_wavs_dir):
    """Build an AN4 manifest from a given transcript file."""
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(') - 1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(') + 1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(target_wavs_dir, file_id + '.wav')

                duration = librosa.core.get_duration(filename=audio_path)

                # Write the metadata to the manifest
                metadata = {"audio_filepath": audio_path, "duration": duration, "text": transcript}
                json.dump(metadata, fout)
                fout.write('\n')

"""Process AN4 dataset."""
if not os.path.exists(source_data_dir):
    link = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
    raise ValueError(
        f"Data not found at `{source_data_dir}`. Please download the AN4 dataset from `{link}` "
        f"and extract it into the folder specified by the `source_data_dir` argument."
    )

# Convert SPH files to WAV files
sph_list = glob.glob(os.path.join(source_data_dir, '**/*.sph'), recursive=True)
target_wavs_dir = os.path.join(target_data_dir, 'wavs')
if not os.path.exists(target_wavs_dir):
    print(f"Creating directories for {target_wavs_dir}.")
    os.makedirs(os.path.join(target_data_dir, 'wavs'))

for sph_path in sph_list:
    wav_path = os.path.join(target_wavs_dir, os.path.splitext(os.path.basename(sph_path))[0] + '.wav')
    cmd = ["sox", sph_path, wav_path]
    subprocess.run(cmd, check=True)

# Build AN4 manifests
train_transcripts = os.path.join(source_data_dir, 'etc/an4_train.transcription')
train_manifest = os.path.join(target_data_dir, 'train_manifest.json')
an4_build_manifest(train_transcripts, train_manifest, target_wavs_dir)

test_transcripts = os.path.join(source_data_dir, 'etc/an4_test.transcription')
test_manifest = os.path.join(target_data_dir, 'test_manifest.json')
an4_build_manifest(test_transcripts, test_manifest, target_wavs_dir)

Let’s listen to a sample audio file.

# change path of the file here
import os
import IPython.display as ipd
path = os.environ["DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

Training#

Create Tokenizer#

Before we can do the actual training, we need to create a tokenizer as this ASR model uses word-piece encoding. Character based models don’t need the tokenizer creation as only single characters are regarded as elements in the vocabulary in their cases. We can use NeMo’s process_asr_text_tokenizer.py script to create the tokenizer that generates the subword vocabulary for us for use in training. The size of the vocabulary (vocab_size) should be the same as the vocabulary size in the ASR model. We will clone the NeMo GitHub repository to use the scripts and examples available there.

# clone NeMo locally
NEMO_DIR = 'FIX_ME/path/to/NeMo'
! git clone https://github.com/NVIDIA/NeMo $NEMO_DIR

# create the tokenizer
!python $NEMO_DIR/scripts/tokenizers/process_asr_text_tokenizer.py \
         --manifest=$DATA_DIR/an4_converted/train_manifest.json \
         --data_root=$DATA_DIR/an4 \
         --vocab_size=128 \
         --tokenizer=spe \
         --spe_type=unigram

Training Conformer-CTC#

NeMo uses .yml files to configure the training parameters. You may update them directly by editing the configuration file or from the command-line interface. For example, if the number of epochs needs to be modified, along with a change in the learning rate, you can add trainer.max_epochs=100 and optim.lr=0.02 and train the model.

The following sample command uses the speech_to_text_ctc_bpe.py script in the examples folder to train/fine-tune a Conformer-CTC ASR model for 1 epoch. For other ASR models like Citrinet, you may find the appropriate config files in the NeMo GitHub repo under examples/asr/conf/.

# To fully train the model from scratch, you'll need to increase trainer.max_epochs from 1.
# Empirical evidence suggests that around 200 epochs should suffice.
!python $NEMO_DIR/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
    --config-path=../conf/conformer/ --config-name=conformer_ctc_bpe \
    +init_from_pretrained_model=stt_en_conformer_ctc_large \
    model.train_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
    model.validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
    model.tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v128 \
    trainer.devices=1 \
    trainer.max_epochs=1 \
    model.optim.name="adamw" \
    model.optim.lr=1.0 \
    model.optim.weight_decay=0.001 \
    model.optim.sched.warmup_steps=2000 \
    ++exp_manager.exp_dir=$DATA_DIR/checkpoints \
    ++exp_manager.version=test \
    ++exp_manager.use_datetime_version=False

!ls ./Conformer-CTC-BPE/test/checkpoints/

nemo_file_path = os.path.join(DATA_DIR, 'checkpoints/Conformer-CTC-BPE/test/checkpoints/Conformer-CTC-BPE.nemo')

ASR Evaluation#

Now that we have a model trained, we need to check how well it performs.

!python $NEMO_DIR/examples/asr/speech_to_text_eval.py \
    model_path=$nemo_file_path \
    dataset_manifest=$DATA_DIR/an4_converted/test_manifest.json \
    output_filename=./test_manifest_predictions.json \
    batch_size=32 \
    amp=True

ASR Model Export#

With NeMo, you can also export your model in a format that can be deployed using NVIDIA Riva: a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for export_format in the spec file.

Install the Packages#

We will now install the NeMo and nemo2riva packages. nemo2riva is available on NVIDIA NGC. Make sure you install NGC CLI first before running the following commands.

from version import __riva_version__
print(__riva_version__)
!pip install nvidia-pyindex
!ngc registry resource download-version "nvidia/riva/riva_quickstart:"$__riva_version__
!pip install nemo2riva
!pip install protobuf==3.20.0

Convert to Riva#

Convert the downloaded model to the .riva format. We will set the encryption key with --key=nemotoriva. Choose a different encryption key value when generating .riva models for production.

riva_file_path = nemo_file_path[:-5]+".riva"
!nemo2riva --out {riva_file_path} --key=nemotoriva {nemo_file_path}

More Resources#

You can find more information about working with NeMo’s ASR models in the ASR section of the NeMo tutorials.

What’s Next?#

You can use NeMo to build custom models for your own applications, and deploy them with NVIDIA Riva! Refer to the Conformer-CTC deployment tutorial.

NVIDIA Riva

How to Fine-Tune a Riva ASR Acoustic Model with NVIDIA NeMo

Contents