The FLEURS dataset is a multilingual speech dataset covering 102 languages, built on top of the FLoRes machine translation benchmark. NeMo Curator provides automated tools to download, extract, and prepare FLEURS data for audio curation pipelines.
The CreateInitialManifestFleursStage handles the complete FLEURS data preparation workflow:
Note: You can omit the explicit executor and call pipeline.run() without arguments. By default, it uses XennaExecutor.
FLEURS supports 102 languages identified by ISO 639-1 and ISO 3166-1 alpha-2 codes:
Choose from three available data splits:
The FLEURS loading stage generates AudioTask objects with the following structure:
To persist results to disk, add a writer stage (for example, JsonlWriter). In a full pipeline, a writer stage creates the result/ directory shown below.
After processing, your directory structure will look like: