Load FLEURS Dataset
The FLEURS dataset is a multilingual speech dataset covering 102 languages, built on top of the FLoRes machine translation benchmark. NeMo Curator provides automated tools to download, extract, and prepare FLEURS data for audio curation pipelines.
How It Works
The CreateInitialManifestFleursStage handles the complete FLEURS data preparation workflow:
- Download: Retrieves audio files and transcription files from Hugging Face
- Extract: Unpacks compressed audio archives
- Manifest Creation: Generates structured manifests with audio file paths and transcriptions
- Manifest References: Produces entries that point to extracted audio files; this stage does not decode or check audio content
Usage
Basic FLEURS Loading
Note: You can omit the explicit executor and call pipeline.run() without arguments. By default, it uses XennaExecutor.
Language Options
FLEURS supports 102 languages identified by ISO 639-1 and ISO 3166-1 alpha-2 codes:
Data Splits
Choose from three available data splits:
Output Format
The FLEURS loading stage generates AudioBatch objects with the following structure:
Configuration Options
Stage Parameters
Batch Processing
To persist results to disk, add a writer stage (for example, JsonlWriter). In a full pipeline, a writer stage creates the result/ directory shown below.
File Organization
After processing, your directory structure will look like: