VoxPopuli Dataset#

VoxPopuli is a large-scale multilingual speech corpus for representation learning, semi-supervised learning, and interpretation. VoxPopuli provides:

  • 400K hours of unlabeled speech data for 23 languages

  • 1.8K hours of transcribed speech data for 16 languages

  • 17.3K hours of speech-to-speech interpretation data for 15x15 directions

  • 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)

The raw data is collected from 2009-2020 European Parliament event recordings.

In this tutorial, we will only use the German portion of the dataset.

Download#

First, we install the necessary packages and download the dataset.

%%bash 
git clone https://github.com/facebookresearch/voxpopuli.git
cd voxpopuli
pip3 install -r requirements.txt

Next, we prepare a folder to store the raw data.

!mkdir -p ./data/raw/voxpopuli
!cd voxpopuli && python3 -m voxpopuli.download_audios --root ../data/raw/voxpopuli --subset asr
!cd voxpopuli && python3 -m voxpopuli.get_asr_data --root ../data/raw/voxpopuli --lang de

Preprocessing#

Next, we standardize the audio data and convert the raw format to a NeMo manifest format.

Audio data: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:

  • Wav format

  • Bit depth: 16 bits

  • Sample rate of 16 Khz

  • Single audio channel

!mkdir -p ./data/processed/voxpopuli
!python3 ./data_ingestion/process_voxpopuli.py --data_root=./data/raw/voxpopuli/transcribed_data --out_dir=./data/processed/voxpopuli
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow. 

#! rm -rf ./data/processed/voxpopuli