VoxPopuli Dataset
Contents
VoxPopuli Dataset#
VoxPopuli is a large-scale multilingual speech corpus for representation learning, semi-supervised learning, and interpretation. VoxPopuli provides:
400K hours of unlabeled speech data for 23 languages
1.8K hours of transcribed speech data for 16 languages
17.3K hours of speech-to-speech interpretation data for 15x15 directions
29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)
The raw data is collected from 2009-2020 European Parliament event recordings.
In this tutorial, we will only use the German portion of the dataset.
Download#
First, we install the necessary packages and download the dataset.
%%bash
git clone https://github.com/facebookresearch/voxpopuli.git
cd voxpopuli
pip3 install -r requirements.txt
Next, we prepare a folder to store the raw data.
!mkdir -p ./data/raw/voxpopuli
!cd voxpopuli && python3 -m voxpopuli.download_audios --root ../data/raw/voxpopuli --subset asr
!cd voxpopuli && python3 -m voxpopuli.get_asr_data --root ../data/raw/voxpopuli --lang de
Preprocessing#
Next, we standardize the audio data and convert the raw format to a NeMo manifest format.
Audio data: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:
Wav format
Bit depth: 16 bits
Sample rate of 16 Khz
Single audio channel
!mkdir -p ./data/processed/voxpopuli
!python3 ./data_ingestion/process_voxpopuli.py --data_root=./data/raw/voxpopuli/transcribed_data --out_dir=./data/processed/voxpopuli
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow.
#! rm -rf ./data/processed/voxpopuli