nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest
Module Contents
Classes
Functions
API
Dataclass
Bases: ProcessingStage[_EmptyTask, AudioTask]
Create initial manifest for the FLEURS dataset.
Dataset link: https://huggingface.co/datasets/google/fleurs
Downloads all files, extracts them, and emits one AudioTask per
transcript line keyed by filepath_key and text_key.
Parameters:
lang
Language code (e.g. "hy_am" for Armenian).
split
Dataset split ("test", "train", or "dev").
raw_data_dir
Folder for downloading and extracting the archive.
filepath_key
Key name used for the audio file path in each emitted entry.
text_key
Key name used for the transcript text in each emitted entry.
batch_size
filepath_key
lang
name
raw_data_dir
split
text_key
Parse transcript TSV file and emit one AudioTask per line.