Mozilla Common Voice (MCV) Dataset
Contents
Mozilla Common Voice (MCV) Dataset#
Mozilla Common Voice (MCV) is a large collection of dataset for speech research. Each entry in the dataset consists of a unique MP3 and corresponding text file. Many of the 20,217 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines.
We will only make use of the German portion of the dataset, which is ~28 GB.
Download#
First, we install the prerequisite packages and download the data.
!apt-get update && apt-get install -y wget sox libsox-fmt-mp3 parallel
The dataset can be downloaded here using a web interface. Upon registration, you will receive a download URL, which can be used with wget as follows:
!mkdir -p ./data/raw/mcv
!wget <DOWNLOAD_URL> -O ./data/raw/mcv/de.tar.gz
Preprocessing#
Next, we standardize audio data and convert the raw format to NeMo manifest format.
Audio data: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:
Wav format
Bit depth: 16 bits
Sample rate of 16 Khz
Single audio channel
import os
import sys
CUR_DIR = os.getcwd()
sys.path.insert(0, os.path.join(CUR_DIR, "data_ingestion"))
Notes:
You will have to pass the correct arg
--version="cv-corpus-xxx"to process_mcv.py depending on the version of your downloaded corpus.
The default value is cv-corpus-5.1-2020-06-22 which refers to the 2020 version of the dataset.
The .tsv file containing metadata of MCV dataset might contain either
accentsoraccentas the column head, hence you might need to update this pre-processing script to look for “accents” instead of “accent”, depending on the particular version.
!mkdir -p data/processed/mcv
OUT_DIR = os.path.join(CUR_DIR, "data/processed/mcv")
DATA_ROOT = os.path.join(CUR_DIR, "data/raw/mcv")
!python3 ./data_ingestion/process_mcv.py --data_root=$DATA_ROOT --data_temp=/tmp --data_out=$OUT_DIR --manifest_dir=$OUT_DIR --save_meta true
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow.
#! rm -rf data/processed/mcv