nemo_curator.stages.audio.alm.alm_manifest_reader
nemo_curator.stages.audio.alm.alm_manifest_reader
ALM Manifest Reader — CompositeStage using FilePartitioningStage + line-by-line JSONL reading.
Avoids Pandas to handle large manifests with deeply nested audio metadata (word timestamps, segments, metrics) that would cause 3-5x memory blow-up with pd.read_json.
Module Contents
Classes
API
Bases: CompositeStage[_EmptyTask, AudioTask]
Composite stage for reading ALM JSONL manifests.
Decomposes into:
- FilePartitioningStage — discovers and partitions manifest files
- ALMManifestReaderStage — reads each partition line-by-line (no Pandas)
Parameters:
Path or list of paths to JSONL manifests (local or cloud).
Number of manifest files per partition. Defaults to 1.
Target size per partition (e.g., “100MB”). Ignored if files_per_partition is set.
File extensions to filter. Defaults to [“.jsonl”, “.json”].
Storage options for cloud paths (S3, GCS credentials, endpoints).
Bases: ProcessingStage[FileGroupTask, AudioTask]
Read JSONL manifest files from a FileGroupTask and emit one AudioTask per line.
Uses line-by-line streaming via fsspec (no Pandas) to keep memory at ~1x file size. Supports local and cloud paths (S3, GCS).