Multilingual LibriSpeech (MLS) Dataset
Contents
Multilingual LibriSpeech (MLS) Dataset#
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of eight languages; English, German, Dutch, Spanish, French, Italian, Portuguese, and Polish.
In this tutorial, we will only use the German portion of the dataset, which is ~115 GB.
Download#
First, we install the necessary packages and download the dataset.
!pip3 install unidecode
!mkdir -p ./data/raw/mls
!wget https://dl.fbaipublicfiles.com/mls/mls_german.tar.gz -O ./data/raw/mls/mls_german.tar.gz
!tar -zxvf ./data/raw/mls/mls_german.tar.gz -C ./data/raw/mls/
Preprocessing#
Next, we standardize the audio data and convert the raw format to a NeMo manifest format.
Audio data: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:
Wav format
Bit depth: 16 bits
Sample rate of 16 Khz
Single audio channel
!mkdir -p ./data/processed/mls
!python3 ./data_ingestion/process_mls.py --dataset_root=./data/raw/mls/mls_german --out_dir=./data/processed/mls
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow.
#! rm -rf ./data/processed/mls