How to Improve the Accuracy on Noisy Speech by Fine-Tuning the Acoustic Model (Conformer-CTC) in the Riva ASR Pipeline#

This tutorial walks you through some of the advanced customization features of the Riva ASR pipeline by fine-tuning the acoustic model (Conformer-CTC). These customization features improve accuracy on specific speech scenarios, like background noise and different acoustic environments.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will show how to augment your training data (with background noise data) for fine-tuning the acoustic model (Conformer-CTC) to improve accuracy on audio with background noise.
To understand the basics of Riva ASR APIs, refer to Getting started with Riva ASR in Python.

For more information about Riva, refer to the Riva product page and documentation.

Data Preprocessing#

For fine-tuning, we need audio data with background noise. If you already have such data, then you can use it directly.
In this tutorial, we will take the AN4 dataset and augment it with noise data from the Room Impulse Response and Noise Database from the OpenSLR database.

In this tutorial, we will be using NVIDIA NeMo for the data preprocessing step.

NVIDIA NeMo Overview#

NVIDIA NeMo is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train conversational AI models on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. For more information about NeMo, refer to the NeMo product page and documentation. The open-source NeMo repository can be found here.

Requirements and Setup for Data Preprocessing:#

We will be using NVIDIA NeMo for this data preprocessing step. While we have provided the code necessary to clone the NeMo GitHub repo and install the NeMo Python modules in our recommended virtual environment, you might find it more convenient to install and run NeMo through NVIDIA’s PyTorch or NeMo Docker container. Pulling either image requires access to NGC. Refer to the instructions here to set up an appropriate Docker container.

Download and Process the AN4 Dataset#

AN4 is a small dataset recorded and distributed by Carnegie Mellon University (CMU). It consists of recordings of people spelling out addresses, names, etc. Information about this dataset can be found on the official CMU site.

Let’s download the AN4 dataset tar file.

# Install the necessary dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2
!pip install Cython

# Import the necessary dependencies.
import wget
import glob
import os
import subprocess
import tarfile
# This is the working directory for this part of the tutorial. 
working_dir = 'am_finetuning/'
!mkdir -p $working_dir

# The AN4 directory will be created in `data_dir`. It is currently set to the `working_dir`.
data_dir = os.path.abspath(working_dir)

# Download the AN4 dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `an4_path` which points to the downloaded AN4 dataset
if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
    an4_url = ''
    an4_path =, data_dir)
    print(f"AN4 dataset downloaded at: {an4_path}")
    print("AN4 dataset tarfile already exists. Proceed to the next step.")
    an4_path = data_dir + '/an4_sphere.tar.gz'

Now, let’s untar the tar file to give us the dataset audio files in .sph format. Then, we’ll convert the .sph files to 16kHz .wav files using the SoX library.

if not os.path.exists(data_dir + '/an4/'):
    # Untar
    tar =
    print("Completed untarring the AN4 tarfile")
    # Convert .sph to .wav (using sox)
    print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        #converting to 16kHz wav
        cmd = f"sox {sph_path} -r 16000 {wav_path}", shell=True)
    print("Finished converting the .sph files to .wav files")
    print("AN4 dataset directory already exists. Proceed to the next step.")

Next, let’s build the manifest files for the AN4 dataset. The manifest file is a .json file that maps the .wav clip to its corresponding text.

Each entry in the AN4 dataset’s manifest .json file follows the template:
{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "text": "<text from the .wav file>"}
Example: {"audio_filepath": "/tutorials/am_finetuning/an4/wav/an4_clstk/fash/an251-fash-b.wav", "duration": 1.0, "text": "yes"}

# Import the necessary libraries.
import json
import subprocess

# Method to build a manifest.
def build_manifest(transcripts_path, manifest_path, wav_path):
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(')-1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(')+1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(
                    data_dir, wav_path,
                    file_id[file_id.find('-')+1 : file_id.rfind('-')],
                    file_id + '.wav')

                duration = float(subprocess.check_output(
                      "soxi -D {0}".format(audio_path), shell=True))
                #duration = WAVE(filename=audio_path).info.length

                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript
                fout.write(json.dumps(metadata) + '\n')
# Building the manifest files.
print("***Building manifest files***")

# Building manifest files for the training data
train_transcripts = data_dir + '/an4/etc/an4_train.transcription'
train_manifest = data_dir + '/an4/train_manifest.json'
if not os.path.isfile(train_manifest):
    build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')
    print("Training manifest created at", train_manifest)
    print("Training manifest already exists at", train_manifest)

# Building manifest files for the test data
test_transcripts = data_dir + '/an4/etc/an4_test.transcription'
test_manifest = data_dir + '/an4/test_manifest.json'
if not os.path.isfile(test_manifest):
    build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')
    print("Test manifest created at", test_manifest)
    print("Test manifest already exists at", test_manifest)


Download and Process the Background Noise Dataset#

For background noise, we will use the background noise samples from the Room Impulse Response and Noise database from the OpenSLR database. For each 30 second isotropic noise sample in the dataset, we use the first 15 seconds for training and the last 15 seconds for evaluation.

First, let’s download the dataset.

# Download the background noise dataset if it doesn't already exist in `data_dir`. 
# This will take a few moments...
# We also set `noise_path` which points to the downloaded background noise dataset.

if not os.path.exists(data_dir + '/'):
    slr28_url = ''
    noise_path =, data_dir)
    print("Background noise dataset download complete.")
    print("Background noise dataset already exists. Proceed to the next step.")
    noise_path = data_dir + '/'

Now, we are going to unzip the .zip file, which gives us the dataset audio files as 8-channel .wav files, sampled at 16kHz. The format and sample rate suit our purposes, but we need to convert these files to mono-channel to match the files in the AN4 dataset. Fortunately, the SoX library provides tools for that as well.

Note: The conversion will take several minutes.

# Extract noise data
from zipfile import ZipFile
if not os.path.exists(data_dir + '/RIRS_NOISES'):
        with ZipFile(noise_path, "r") as zipObj:
            print("Extracting noise data complete")
        # Convert 8-channel audio files to mono-channel
        wav_list = glob.glob(data_dir + '/RIRS_NOISES/**/*.wav', recursive=True)
        for wav_path in wav_list:
            mono_wav_path = wav_path[:-4] + '_mono.wav'
            cmd = f"sox {wav_path} {mono_wav_path} remix 1"
  , shell=True)
        print("Finished converting the 8-channel noise data .wav files to mono-channel")
    except Exception:
        print("Not extracting. Extracted noise data might already exist.")
    print("Extracted noise data already exists. Proceed to the next step.")

Next, let’s build the manifest files for the noise data. The manifest file is a .json file that maps the .wav clip to its corresponding text.

Each entry in the noise data’s manifest .json file follows the template:
{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "offset": <offset value>, "text": "-"}
Example: {"audio_filepath": "/tutorials/am_finetuning/RIRS_NOISES/real_rirs_isotropic_noises/RVB2014_type1_noise_largeroom1_1_mono.wav", "duration": 30.0, "offset": 0, "text": "-"}

import json
iso_path = os.path.join(data_dir,"RIRS_NOISES/real_rirs_isotropic_noises")
iso_noise_list = os.path.join(iso_path, "noise_list")

# Edit the noise_list file so that it lists the *_mono.wav files instead of the original *.wav files
with open(iso_noise_list) as f:
    if '_mono.wav' in
        print(f"{iso_noise_list} has already been processed")
        cmd = f"sed -i 's|.wav|_mono.wav|g' {iso_noise_list}", shell=True)
        print(f"Finished processing {iso_noise_list}")
# Create the manifest files from noise files
def process_row(row, offset, duration):
    entry = {}
    wav_f = row['wav_filename']
    newfile = wav_f
    duration = subprocess.check_output('soxi -D {0}'.format(newfile), shell=True)
    entry['audio_filepath'] = newfile
    entry['duration'] = float(duration)
    entry['offset'] = offset
    entry['text'] = row['transcript']
    return entry
  except Exception as e:
    wav_f = row['wav_filename']
    newfile = wav_f
    print(f"Error processing {newfile} file!!!")
train_rows = []
test_rows = []

with open(iso_noise_list,"r") as in_f:
    for line in in_f:
        row = {}
        data = line.rstrip().split()
        row['wav_filename'] = os.path.join(data_dir,data[-1])
        row['transcript'] = "-"
        train_rows.append(process_row(row, 0 , 15))
        test_rows.append(process_row(row, 15 , 15))

# Writing manifest files
def write_manifest(manifest_file, manifest_lines):
    with open(manifest_file, 'w') as fout:
      for m in manifest_lines:
        fout.write(json.dumps(m) + '\n')
      print("Writing manifest file to", manifest_file, "complete")

# Writing training and test manifest files
test_noise_manifest  = os.path.join(data_dir, "test_noise_manifest.json")
train_noise_manifest = os.path.join(data_dir, "train_noise_manifest.json")
if not os.path.exists(test_noise_manifest):
    write_manifest(test_noise_manifest, test_rows)
    print('Test noise manifest file already exists. Proceed to the next step.')
if not os.path.exists(train_noise_manifest):
    write_manifest(train_noise_manifest, train_rows)
    print('Train noise manifest file already exists. Proceed to the next step.')

Create the Noise-Augmented Dataset#

Finally, let’s create a noise-augmented dataset by adding noise to the AN4 dataset with the NeMo script. This script generates the noise-augmented audio clips as well as the manifest files.

Each entry in the noise-augmented data’s manifest file follows the template:
{"audio_filepath": "<.wav file location>", "duration": <duration of the .wav file>, "text": "<text from the .wav file>"} Example: {"audio_filepath": "/tutorials/am_finetuning/noise_data/train_manifest/train_noise_0db/an251-fash-b.wav", "duration": 1.0, "text": "yes"}


Install the NeMo Python module and clone the NeMo GitHub repo locally. In the rest of this tutorial, we’ll use scripts from the NeMo repo which need the NeMo Python module in order to run.

## Install NeMo
BRANCH = 'main'
!python -m pip install git+$BRANCH#egg=nemo_toolkit[all]

# Clone NeMo locally
nemo_dir = os.path.join(os.getcwd(), 'NeMo')
!git clone $nemo_dir
Training Dataset#

Let’s create a noise-augmented training dataset using the AN4 training dataset. We’ll add noise at different SNRs (Signal-to-Noise Ratios) ranging from 0 to 15 dB SNR using a NeMo script. Note that a 0 dB SNR means that the noise and signal in the given audio file are of equal volume.

final_data_dir = os.path.join(data_dir, 'noise_data')

train_manifest = os.path.join(data_dir, 'an4/train_manifest.json')
test_manifest  = os.path.join(data_dir, 'an4/test_manifest.json')

train_noise_manifest = os.path.join(data_dir, 'train_noise_manifest.json')
test_noise_manifest  = os.path.join(data_dir, 'test_noise_manifest.json')

!python $nemo_dir/scripts/dataset_processing/ \
    --input_manifest=$train_manifest \
    --noise_manifest=$train_noise_manifest \
    --snrs 0 5 10 15 \

The above script generates a .json manifest file each for every SNR value, that is, one manifest file each for 0, 5, 10, and 15db SNR.

First, let’s give these manifest files less cumbersome names.

noisy_train_manifest_files = os.listdir(os.path.join(final_data_dir, 'manifests'))
for filename in noisy_train_manifest_files:
    new_filename = filename.replace('train_manifest_train_noise_manifest', 'noisy_train_manifest')
    new_filepath = os.path.join(final_data_dir, 'manifests', new_filename)
    filepath = os.path.join(final_data_dir, 'manifests', filename)
    os.rename(filepath, new_filepath)

Now, let’s combine all the manifests for noise-augmented training data into a single manifest.

!cat $final_data_dir/manifests/noisy* > $final_data_dir/manifests/noisy_train_manifest.json

print("Combined manifest for noise-augmented training dataset created at", final_data_dir + "/manifests/noisy_train_manifest.json")
Test dataset#

Let’s create a noise-augmented evaluation dataset using the AN4 test dataset, by adding noise at 5 dB, using the same NeMo script with which we augmented the training dataset.

!python $nemo_dir/scripts/dataset_processing/ \
    --input_manifest=$test_manifest \
    --noise_manifest=$test_noise_manifest \
    --snrs=5 \

print("Noise-augmented testing dataset created at", final_data_dir+"/test_manifest")

Again, let’s give the manifest file for the noise-augmented test data a less cumbersome name.

noisy_test_manifest_files = glob.glob(os.path.join(final_data_dir, 'manifests/test*'))
for filename in noisy_test_manifest_files:
    new_filename = filename.replace('test_manifest_test_noise_manifest', 'noisy_test_manifest')
    new_filepath = os.path.join(final_data_dir, 'manifests', new_filename)
    filepath = os.path.join(final_data_dir, 'manifests', filename)
    os.rename(filepath, new_filepath)
print("Manifest for noise-augmented test dataset created at", final_data_dir + "/manifests/noisy_test_manifest_5db.json")

Noise-augmented training manifest and data are created at {working_dir}/noise_data/noisy_train_manifest.json and {working_dir}/noise_data/train_manifest respectively.
Noise-augmented testing manifest and data are created at {working_dir}/noise_data/manifests/noisy_test_manifest_5db.json and {working_dir}/noise_data/test_manifest respectively.

Fine-Tuning the ASR Model#

To fine-tune the ASR model with the augmented datasets that we just created, you can proceed to this tutorial. In this case, make sure to reset the manifest and dataset file paths appropriately when calling the NeMo tokenization, training, and evaluation scripts.