Text Normalization#

Text Normalization converts text from written form into its verbalized form. It is used as a preprocessing step for preprocessing Automatic Speech Recognition (ASR) training transcripts. For German text normalization, we primarily leverage the NeMo text normalization library.

In this tutorial, we will employ NeMo to normalize the Mozilla Common Voice (MCV), Multilingual LibriSpeech (MLS), and VoxPopuli datasets. The following code takes in a manifest file, normalizes the transcripts, and writes back the normalized manifest file.

Note: This tutorial should be run within a NeMo Docker container with the following command:

docker run --gpus=all --net=host --rm -it -v $PWD:/myworkspace nvcr.io/nvidia/nemo:22.08 bash

Then, from within the NeMo container, the Jupyter lab environment can be started.

Note: this process will take a long time. On VoxPopuli, every 10k samples take an additional 1 hour on 80 CPU cores.

from typing import List
import os
import json
import multiprocessing

from tqdm import tqdm
from functools import partial

from nemo_text_processing.text_normalization.normalize import Normalizer

def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf_8') as fp:
        inlines = fp.readlines()
        for line in inlines:
            if line.startswith("//") or line.strip() == '':
                continue
            row = json.loads(line)
            data.append(row)
    return data


def dump_jsonl(filepath, data):
    with open(filepath, 'w') as fp:
        for datum in data:
            row = json.dumps(datum, ensure_ascii=False)
            fp.write(row)
            fp.write('\n')
            
def normalize_manifest(input_manifest, output_manifest, normalizer):
    utterances = load_jsonl(input_manifest)
    transcripts = [utt['text_original'] for utt in utterances]
    
    pool = multiprocessing.Pool(processes=os.cpu_count())
    normalized_result = tqdm(pool.imap(partial(normalizer.normalize, verbose=False), transcripts))
    for i, text in enumerate(normalized_result):
        utterances[i]['text'] = text  
    dump_jsonl(output_manifest, utterances)
#normalizer = Normalizer(input_case="cased", lang='de')
normalizer = Normalizer(
        input_case="cased",
        cache_dir="/tmp",
        overwrite_cache=True,
        lang="de",
    )
    
#for dataset in ['mls', 'voxpopuli', 'mcv']:
for dataset in ['mcv']:
    for subset in ['train', 'dev', 'test']:        
        input_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest.json")
        output_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized.json")
        print("Processing ", input_manifest)
        normalize_manifest(input_manifest, output_manifest, normalizer)
            
[NeMo I 2022-05-06 06:46:33 tokenize_and_classify:83] Creating ClassifyFst grammars. This might take some time...
Created /tmp/_cased_de_tn_True_deterministic.far
[NeMo I 2022-05-06 06:46:56 tokenize_and_classify:143] ClassifyFst grammars are saved to /tmp/_cased_de_tn_True_deterministic.far.
Processing  ./data/processed/mcv/mcv_train_manifest.json
9471it [53:20,  2.34it/s]