nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest

View as Markdown

Module Contents

Classes

NameDescription
CreateInitialManifestFleursStageCreate initial manifest for the FLEURS dataset.

Functions

NameDescription
get_fleurs_url_listexamples

API

class nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage(
name: str = 'CreateInitialManifestFleurs',
lang: str = '',
split: str = '',
raw_data_dir: str = '',
filepath_key: str = 'audio_filepath',
text_key: str = 'text',
batch_size: int = 1
)
Dataclass

Bases: ProcessingStage[_EmptyTask, AudioTask]

Create initial manifest for the FLEURS dataset.

Dataset link: https://huggingface.co/datasets/google/fleurs

Downloads all files, extracts them, and emits one AudioTask per transcript line keyed by filepath_key and text_key.

Parameters:

lang
strDefaults to ''

Language code (e.g. "hy_am" for Armenian).

split
strDefaults to ''

Dataset split ("test", "train", or "dev").

raw_data_dir
strDefaults to ''

Folder for downloading and extracting the archive.

filepath_key
strDefaults to 'audio_filepath'

Key name used for the audio file path in each emitted entry.

text_key
strDefaults to 'text'

Key name used for the transcript text in each emitted entry.

batch_size
int = 1
filepath_key
str = 'audio_filepath'
lang
str = ''
name
str = 'CreateInitialManifestFleurs'
raw_data_dir
str = ''
split
str = ''
text_key
str = 'text'
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.__post_init__() -> None
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.download_extract_files(
dst_folder: str
) -> None
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.process(
_: nemo_curator.tasks._EmptyTask
) -> list[nemo_curator.tasks.AudioTask]
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.process_transcript(
file_path: str
) -> list[nemo_curator.tasks.AudioTask]

Parse transcript TSV file and emit one AudioTask per line.

nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest.get_fleurs_url_list(
lang: str,
split: str
) -> list[str]