For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
              • Advanced Pipelines
              • Alm
              • Common
              • Datasets
              • Filtering
              • Inference
              • Io
              • Metrics
              • Postprocessing
              • Preprocessing
              • Segmentation
              • Tagging
                • Inference
                • Merge Alignment Diarization
                • Resample Audio
                • Split
                • Utils
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesAudioTagging

nemo_curator.stages.audio.tagging.split

||View as Markdown|
Previous

nemo_curator.stages.audio.tagging.resample_audio

Next

nemo_curator.stages.audio.tagging.utils

Audio Splitting and Joining Stages.

Module Contents

Classes

NameDescription
JoinSplitAudioMetadataStageStage for joining metadata of previously split audio files.
SplitASRAlignJoinStageComposite stage: Split long audio -> ASR align -> Join results.
SplitLongAudioStageStage that splits long audio files into smaller segments.

API

class nemo_curator.stages.audio.tagging.split.JoinSplitAudioMetadataStage(
name: str = 'JoinSplitAudioMetadata'
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Stage for joining metadata of previously split audio files.

Combines the metadata (transcripts and alignments) of audio files that were previously split by SplitLongAudioStage. Adjusts timestamps and concatenates transcripts to recreate the original audio’s metadata.

name
str = 'JoinSplitAudioMetadata'
nemo_curator.stages.audio.tagging.split.JoinSplitAudioMetadataStage._join_split_metadata(
meta_entry: dict
) -> None

Join metadata from split audio files.

nemo_curator.stages.audio.tagging.split.JoinSplitAudioMetadataStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.tagging.split.JoinSplitAudioMetadataStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.tagging.split.JoinSplitAudioMetadataStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask

Process entries and join split audio metadata.

This stage collects all entries and processes meta-entries to join split audio files back together.

class nemo_curator.stages.audio.tagging.split.SplitASRAlignJoinStage(
suggested_max_len: float = 3600.0,
min_len: float = 1.0,
model_name: str = 'nvidia/parakeet-tdt_ctc-1.1b',
model_path: str | None = None,
is_fastconformer: bool = True,
decoder_type: str = 'rnnt',
max_len: float = 40.0,
batch_size: int = 100,
transcribe_batch_size: int = 32,
split_batch_size: int = 5000,
num_workers: int = 10,
infer_segment_only: bool = False,
compute_timestamps: bool = True,
timestamp_type: str = 'word',
text_key: str = 'text',
words_key: str = 'words',
disable_word_confidence: bool = False,
segments_key: str = 'segments',
name: str = 'SplitASRAlignJoin'
)
Dataclass

Bases: CompositeStage[AudioTask, AudioTask]

Composite stage: Split long audio -> ASR align -> Join results.

Decomposes into three sequential stages that always run together:

  1. SplitLongAudioStage — splits audio exceeding suggested_max_len
  2. NeMoASRAlignerStage — transcribes and aligns each chunk
  3. JoinSplitAudioMetadataStage — merges transcripts back into original entries

Parameters:

suggested_max_len
floatDefaults to 3600.0

Target max length for audio segments (seconds).

min_len
floatDefaults to 1.0

Minimum length for any split segment (also used by ASR).

max_len
floatDefaults to 40.0

Maximum length of audio segments for ASR processing (seconds).

model_name
strDefaults to 'nvidia/parakeet-tdt_ctc-1.1b'

Pretrained NeMo ASR model name.

model_path
str | NoneDefaults to None

Local model file path (overrides model_name if set).

is_fastconformer
boolDefaults to True

Whether the model encoder is FastConformer.

decoder_type
strDefaults to 'rnnt'

Decoder type — "ctc" or "rnnt".

batch_size
intDefaults to 100

Entries per processing chunk in ASR.

transcribe_batch_size
intDefaults to 32

Batch size passed to the ASR model’s transcribe call.

split_batch_size
intDefaults to 5000

Max entries/paths per batch when chunking segments.

num_workers
intDefaults to 10

Data-loading workers for ASR inference.

infer_segment_only
boolDefaults to False

If True, run ASR only on individual segments rather than full audio / meta-entries.

compute_timestamps
boolDefaults to True

Whether to compute word-level timestamps.

timestamp_type
strDefaults to 'word'

Timestamp granularity ("word" or "char").

text_key
strDefaults to 'text'

Output key for predicted text.

words_key
strDefaults to 'words'

Output key for word-level alignments.

disable_word_confidence
boolDefaults to False

Whether to disable word confidence scores.

segments_key
strDefaults to 'segments'

Key for the segments list in each manifest entry.

batch_size
int = 100
compute_timestamps
bool = True
decoder_type
str = 'rnnt'
disable_word_confidence
bool = False
infer_segment_only
bool = False
is_fastconformer
bool = True
max_len
float = 40.0
min_len
float = 1.0
model_name
str = 'nvidia/parakeet-tdt_ctc-1.1b'
model_path
str | None = None
name
str = 'SplitASRAlignJoin'
num_workers
int = 10
segments_key
str = 'segments'
split_batch_size
int = 5000
suggested_max_len
float = 3600.0
text_key
str = 'text'
timestamp_type
str = 'word'
transcribe_batch_size
int = 32
words_key
str = 'words'
nemo_curator.stages.audio.tagging.split.SplitASRAlignJoinStage.__post_init__() -> None
nemo_curator.stages.audio.tagging.split.SplitASRAlignJoinStage.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
class nemo_curator.stages.audio.tagging.split.SplitLongAudioStage(
suggested_max_len: float = 3600.0,
min_len: float = 1.0,
name: str = 'SplitLongAudio'
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Stage that splits long audio files into smaller segments.

Processes audio files that exceed a specified maximum length by splitting them at natural pauses to maintain speech coherence.

Parameters:

suggested_max_len
floatDefaults to 3600.0

Target maximum length for audio segments in seconds

min_len
floatDefaults to 1.0

Minimum length for any split segment

min_len
float = 1.0
name
str = 'SplitLongAudio'
suggested_max_len
float = 3600.0
nemo_curator.stages.audio.tagging.split.SplitLongAudioStage._build_split_metadata(
audio_item_id: str,
split_filepaths: list[str],
split_durations: list[float],
fallback: bool = False
) -> list[dict]
staticmethod

Build per-split metadata dicts from filepaths and durations.

nemo_curator.stages.audio.tagging.split.SplitLongAudioStage._do_split(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask

Core splitting logic, separated to keep statement count within limits.

nemo_curator.stages.audio.tagging.split.SplitLongAudioStage.get_split_points(
metadata: dict
) -> list[float]

Get the split points for the audio file based on segments.

nemo_curator.stages.audio.tagging.split.SplitLongAudioStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.tagging.split.SplitLongAudioStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.tagging.split.SplitLongAudioStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask

Process entry to split long audio files.