Multilingual LibriSpeech (MLS) Dataset#

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of eight languages; English, German, Dutch, Spanish, French, Italian, Portuguese, and Polish.

In this tutorial, we will only use the German portion of the dataset, which is ~115 GB.

Download#

First, we install the necessary packages and download the dataset.

!pip3 install unidecode
!mkdir -p ./data/raw/mls
!wget https://dl.fbaipublicfiles.com/mls/mls_german.tar.gz -O ./data/raw/mls/mls_german.tar.gz
!tar -zxvf ./data/raw/mls/mls_german.tar.gz -C ./data/raw/mls/

Preprocessing#

Next, we standardize the audio data and convert the raw format to a NeMo manifest format.

Audio data: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:

  • Wav format

  • Bit depth: 16 bits

  • Sample rate of 16 Khz

  • Single audio channel

!mkdir -p ./data/processed/mls
!python3 ./data_ingestion/process_mls.py --dataset_root=./data/raw/mls/mls_german --out_dir=./data/processed/mls
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow. 

#! rm -rf ./data/processed/mls