stages.audio.datasets.fleurs.create_initial_manifest#
Module Contents#
Classes#
Stage to create initial manifest for the FLEURS dataset. |
Functions#
examples “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz”, “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv” |
API#
- class stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks._EmptyTask,nemo_curator.tasks.AudioBatch]Stage to create initial manifest for the FLEURS dataset.
Dataset link: https://huggingface.co/datasets/google/fleurs
Will download all files, extract them, and create a manifest file with the “audio_filepath” and “text” fields.
Args: lang (str): Language to be processed, identified by a combination of ISO 639-1 and ISO 3166-1 alpha-2 codes. Examples are:
- ``"hy_am"`` for Armenian - ``"ko_kr"`` for Korean split (str): Which dataset splits to process. Options are: - ``"test"`` - ``"train"`` - ``"dev"`` raw_data_dir (str): Path to the folder where the data archive should be downloaded and extracted.Returns: This srage generates an initial SpeechObject with the following fields:
{ "audio_filepath": <path to the audio file>, "text": <transcription>, }- batch_size: int#
1
- download_extract_files(dst_folder: str) None#
downloading and extracting files
- filepath_key: str#
‘audio_filepath’
- lang: str#
None
- name: str#
‘CreateInitialManifestFleurs’
- process(
- _: nemo_curator.tasks._EmptyTask,
Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out
- process_transcript(
- file_path: str,
Parse transcript TSV file and put it inside manifest. Assumes the TSV file has two columns: file name and text.
- raw_data_dir: str#
None
- split: str#
None
- text_key: str#
‘text’
- stages.audio.datasets.fleurs.create_initial_manifest.get_fleurs_url_list(lang: str, split: str) list[str]#
examples “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz”, “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv”