Evaluate a TTS Pipeline#

In this tutorial, we will use Automatic speech recognition(ASR) to generate transcripts from TTS synthesized data and compare the generated transcripts against the groundtruth using character error rate (CER) and word error rate (WER).

These metrics are useful to find any inconsistencies between audio, transcript pair by comparing ASR generated transcripts with the ground truth transcripts.

The tutorial will include:

  • Downloading 5 minutes of hifiTTS audio transcript pairs.

  • Generating transcripts for the audios using a pretrained NeMo ASR model.

  • Calculating character error rate and word error rate between ground truth transcripts, and ASR generated transcripts.

Download data#

For our tutorial, we will use a small part of the Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) dataset. You can read more about dataset here. We will use speaker 6097 as the target speaker, and only a 5-minute subset of audio will be used for this evaluation example.

!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data
!tar -xzf 6097_5_mins.tar.gz
manifest_file = "6097_5_mins/manifest.json"
asr_pred = "6097_5_mins/asr_pred.json"


## Fix audiopaths in manifest.json
!sed -i 's,audio/,6097_5_mins/audio/,g' {manifest_file}

Looking at manifest.json, we see a standard NeMo json that contains the filepath, text, and duration. Please make sure that manifest.json contains the relative path.

The manifest file should look this:

{"audio_filepath": "6097_5_mins/audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,"}
## Print the first line of manifest file.
!head -n 1 {manifest_file}

Synthesize text from asr.#

We will need nemo toolkit and transcribe_speech.py to generate transcripts for our audio samples.

Lets install the nemo toolkit.

## Clone the latest NeMo.
!pip install nemo_toolkit['all']
!pip install --upgrade protobuf==3.20.0

Now download transcribe_speech.py

!wget https://raw.githubusercontent.com/NVIDIA/NeMo/stable/examples/asr/transcribe_speech.py

Transcribe audio samples using nemo and transcribe_speech.py. This will be later used to calcualte character error rate and word error rate.

The model used is an English pretrained conformer CTC ASR model.

# Generate transcriptions
!python transcribe_speech.py \
    pretrained_name=stt_en_conformer_ctc_large \
    dataset_manifest={manifest_file} \
    output_filename={asr_pred} \
    batch_size=32 ++compute_langs=False cuda=0 amp=True

Lets take a look at the asr_pred file and make sure we have a text field and an pred_text field. The asr_pred file should look like this:

{"audio_filepath": "6097_5_mins/audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,", "pred_text": "not to stop more than ten minutes by the way"}
!head -2 {asr_pred}

Calculate character error rate (CER).#

Edit distance or Levenshtein distance is a metric to measure the similarity of two strings. The metric accounts for any additions, deletions or substitutions in ground truth to get the evaluation string.

Use Levenshtein distance to measure edit distance and character error rate between generated transcript and ground truth transcript.

character error rate is edit distance per word of ground truth. It can also be interpreted as normalised edit distance.

\(error\ rate = \frac{edit\ distance}{no\ of\ words\ in\ ground\ truth}\)

## Install the edit distance package
!pip install editdistance
## Install ndjson to read the asr_pred file
!pip install ndjson
import editdistance
import ndjson
import string

Set thresholds for edit distance and error rate. Any utterance with that exceeds these thresholds requires investigation. These values can be finetuned.

distance_threshold = 5
cer_threshold = 0.5

Since ASR transcripts does not contain any punctuation, remove punctuation from original transcript before calculating edit distance.

## Punctuation translation dictionary.
punct_dict = str.maketrans('', '', string.punctuation)

f = open(asr_pred)
manifest = ndjson.load(f)
f.close()

Calculate edit distance and print all utterances with:

  • error_rate > error_threshold

  • distance > distance_threshold

for line in manifest:
    transcript = line["text"].lower().translate(punct_dict)
    pred_text = line["pred_text"]
    try:
        distance = editdistance.eval(transcript, pred_text)
        cer = distance / len(transcript)
    except Exception as e:
        print(f"Got error: {e} for line: {line}")
        distance = 0
        cer = 0
    if distance > distance_threshold or cer > cer_threshold:
        print(f"Low confidence for {line}")

Calculate WER(Word error rate)#

Now we have listed all the sentences with high character error rate, we will list all the sentences with high Word error rate.

Word error rate as the name suggests measures the errors at word level instead of character level in character error rate. This metric accounts for number of word substitution, word insertions and word deletions from reference text.

The formula for calculation is: $\( WER=\frac{S+I+D}{N} \)$ S = Number of substitutions
I = Number of insertions
D = Number of deletions
N = Total number of words in reference text

We will use python package jiwer.

## Install python package to calculate word error rate.
!pip install jiwer
from jiwer import wer

Set threshold for word error rate. Any utterance with WER greater than this value requires investigation. This value can be finetuned.

wer_threshold = 0.8 #Can be finetuned.

Calculate word error rate and print all the utterances with high word error rate

for line in manifest:
    transcript = line["text"].lower().translate(punct_dict)
    pred_text = line["pred_text"]
    try:
        error_rate = wer(transcript, pred_text)
    except Exception as e:
        print(f"Got error: {e} for line: {line}")
        error_rate = 0
    if error_rate > wer_threshold:
        print(f"Low confidence for file: {line['audio_filepath']} --- Transcript: {transcript} --- Predicted text: {pred_text} --- Word error rate: {error_rate}")

Conclusion#

In this tutorial we have learned to calculate edit distance, character error rate and word rate. We also learned how to apply these metrics to evaluate the quality of an audio, transcript pair.

These types of metrics can be useful smoke tests and selecting a candidate model. But at the end, the only way to measure the quality of TTS model is to use subjective methods for evaluating and comparing models such as MOS and CMOS.