stages.audio.datasets.fleurs.create_initial_manifest#

Module Contents#

Classes#

CreateInitialManifestFleursStage

Stage to create initial manifest for the FLEURS dataset.

Functions#

get_fleurs_url_list

examples “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz”, “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv”

API#

class stages.audio.datasets.fleurs.create_initial_manifest.CreateInitialManifestFleursStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.AudioBatch]

Stage to create initial manifest for the FLEURS dataset.

Dataset link: https://huggingface.co/datasets/google/fleurs

Will download all files, extract them, and create a manifest file with the “audio_filepath” and “text” fields.

Args: lang (str): Language to be processed, identified by a combination of ISO 639-1 and ISO 3166-1 alpha-2 codes. Examples are:

    - ``"hy_am"`` for Armenian
    - ``"ko_kr"`` for Korean

split (str): Which dataset splits to process.
    Options are:

    - ``"test"``
    - ``"train"``
    - ``"dev"``

raw_data_dir (str): Path to the folder where the data archive should be downloaded and extracted.

Returns: This srage generates an initial SpeechObject with the following fields:

    {
        "audio_filepath": <path to the audio file>,
        "text": <transcription>,
    }
batch_size: int#

1

download_extract_files(dst_folder: str) None#

downloading and extracting files

filepath_key: str#

‘audio_filepath’

lang: str#

None

name: str#

‘CreateInitialManifestFleurs’

process(
_: nemo_curator.tasks._EmptyTask,
) list[nemo_curator.tasks.AudioBatch]#

Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out

process_transcript(
file_path: str,
) list[nemo_curator.tasks.AudioBatch]#

Parse transcript TSV file and put it inside manifest. Assumes the TSV file has two columns: file name and text.

raw_data_dir: str#

None

split: str#

None

text_key: str#

‘text’

stages.audio.datasets.fleurs.create_initial_manifest.get_fleurs_url_list(lang: str, split: str) list[str]#

examples “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/audio/dev.tar.gz”, “https://huggingface.co/datasets/google/fleurs/resolve/main/data/hy_am/dev.tsv”