For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
      • Overview
      • Custom Manifests
      • FLEURS Dataset
      • Local Files
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How It Works
  • Usage
  • Basic FLEURS Loading
  • Language Options
  • Data Splits
  • Output Format
  • Configuration Options
  • Stage Parameters
  • Batch Processing
  • File Organization
Curate AudioLoad Data

Load FLEURS Dataset

||View as Markdown|
Previous

Custom Manifests

Next

Local Files

The FLEURS dataset is a multilingual speech dataset covering 102 languages, built on top of the FLoRes machine translation benchmark. NeMo Curator provides automated tools to download, extract, and prepare FLEURS data for audio curation pipelines.

How It Works

The CreateInitialManifestFleursStage handles the complete FLEURS data preparation workflow:

  1. Download: Retrieves audio files and transcription files from Hugging Face
  2. Extract: Unpacks compressed audio archives
  3. Manifest Creation: Generates structured manifests with audio file paths and transcriptions
  4. Manifest References: Produces entries that point to extracted audio files; this stage does not decode or check audio content

Usage

Basic FLEURS Loading

1from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
2from nemo_curator.pipeline import Pipeline
3
4# Create FLEURS loading stage
5fleurs_stage = CreateInitialManifestFleursStage(
6 lang="hy_am", # Armenian language
7 split="dev", # Development split
8 raw_data_dir="/path/to/audio/data"
9)
10
11# Add to pipeline
12pipeline = Pipeline(name="fleurs_loading")
13pipeline.add_stage(fleurs_stage.with_(batch_size=4))
14
15# Execute
16pipeline.run()

Note: You can omit the explicit executor and call pipeline.run() without arguments. By default, it uses XennaExecutor.

Language Options

FLEURS supports 102 languages identified by ISO 639-1 and ISO 3166-1 alpha-2 codes:

1# Common language examples
2languages = [
3 "en_us", # English (US)
4 "es_419", # Spanish (Latin America)
5 "fr_fr", # French (France)
6 "de_de", # German (Germany)
7 "zh_cn", # Chinese (Simplified)
8 "ja_jp", # Japanese
9 "ko_kr", # Korean
10 "hy_am", # Armenian
11 "ar_eg", # Arabic (Egypt)
12]

Data Splits

Choose from three available data splits:

1splits = ["train", "dev", "test"]
2
3# Load training data for multiple languages
4for lang in ["en_us", "es_419", "fr_fr"]:
5 stage = CreateInitialManifestFleursStage(
6 lang=lang,
7 split="train",
8 raw_data_dir=f"/data/fleurs/{lang}"
9 )

Output Format

The FLEURS loading stage generates AudioTask objects with the following structure:

1{
2 "audio_filepath": "/absolute/path/to/audio.wav",
3 "text": "ground truth transcription text"
4}

Configuration Options

Stage Parameters

1CreateInitialManifestFleursStage(
2 lang="en_us", # Language code
3 split="train", # Data split
4 raw_data_dir="/data/fleurs", # Storage directory
5 filepath_key="audio_filepath", # Key for audio file paths
6 text_key="text", # Key for transcription text
7)

Batch Processing

1# Configure batch size for processing
2fleurs_stage.with_(batch_size=8) # Process 8 files per batch

To persist results to disk, add a writer stage (for example, JsonlWriter). In a full pipeline, a writer stage creates the result/ directory shown below.

File Organization

After processing, your directory structure will look like:

/data/fleurs/en_us/
├── dev.tsv # Transcription metadata (used internally by the stage)
├── dev.tar.gz # Compressed audio files
├── dev/ # Extracted audio files
│ ├── audio_001.wav
│ ├── audio_002.wav
│ └── ...
└── result/ # Processed JSONL manifests (if using full pipeline)
└── *.jsonl