nemo_curator.stages.audio.alm.alm_manifest_reader

View as Markdown

ALM Manifest Reader — CompositeStage using FilePartitioningStage + line-by-line JSONL reading.

Avoids Pandas to handle large manifests with deeply nested audio metadata (word timestamps, segments, metrics) that would cause 3-5x memory blow-up with pd.read_json.

Module Contents

Classes

NameDescription
ALMManifestReaderComposite stage for reading ALM JSONL manifests.
ALMManifestReaderStageRead JSONL manifest files from a FileGroupTask and emit one AudioTask per line.

API

class nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReader(
name: str = 'alm_manifest_reader',
manifest_path: str | list[str] = '',
files_per_partition: int | None = 1,
blocksize: int | str | None = None,
file_extensions: list[str] = (lambda: ['.jsonl', '.json'...,
storage_options: dict[str, typing.Any] | None = None
)
Dataclass

Bases: CompositeStage[_EmptyTask, AudioTask]

Composite stage for reading ALM JSONL manifests.

Decomposes into:

  1. FilePartitioningStage — discovers and partitions manifest files
  2. ALMManifestReaderStage — reads each partition line-by-line (no Pandas)

Parameters:

manifest_path
str | list[str]Defaults to ''

Path or list of paths to JSONL manifests (local or cloud).

files_per_partition
int | NoneDefaults to 1

Number of manifest files per partition. Defaults to 1.

blocksize
int | str | NoneDefaults to None

Target size per partition (e.g., “100MB”). Ignored if files_per_partition is set.

file_extensions
list[str]Defaults to (lambda: ['.jsonl', '.json'])()

File extensions to filter. Defaults to [“.jsonl”, “.json”].

storage_options
dict[str, Any] | NoneDefaults to None

Storage options for cloud paths (S3, GCS credentials, endpoints).

blocksize
int | str | None = None
file_extensions
list[str]
files_per_partition
int | None = 1
manifest_path
str | list[str] = ''
name
str = 'alm_manifest_reader'
storage_options
dict[str, Any] | None = None
nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReader.__post_init__() -> None
nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReader.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReader.get_description() -> str
class nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReaderStage(
name: str = 'alm_manifest_reader_stage'
)
Dataclass

Bases: ProcessingStage[FileGroupTask, AudioTask]

Read JSONL manifest files from a FileGroupTask and emit one AudioTask per line.

Uses line-by-line streaming via fsspec (no Pandas) to keep memory at ~1x file size. Supports local and cloud paths (S3, GCS).

name
str = 'alm_manifest_reader_stage'
nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReaderStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> list[nemo_curator.tasks.AudioTask]
nemo_curator.stages.audio.alm.alm_manifest_reader.ALMManifestReaderStage.ray_stage_spec() -> dict[str, typing.Any]