Curate AudioProcess DataQuality Filtering

Preprocessing Stages

View as Markdown

Three lightweight stages handle the common audio plumbing tasks: collapsing channels, joining segments after filtering, and projecting filtered timestamps back to the original input file. Together they form the scaffolding around the heavier filtering stages — mono conversion runs first, segment concatenation re-merges surviving segments after filtering, and timestamp mapping closes the loop by projecting final boundaries back to source-file positions.

Stage Roles

StageWhenJob
MonoConversionStageFirstNormalize multi-channel input to mono and verify (or resample to) the target sample rate.
SegmentConcatenationStageAfter filtersConcatenate surviving filtered segments back into one waveform with configurable silence between them.
TimestampMapperStageLastResolve final segment positions in the concatenated waveform back to positions in the original source file.

MonoConversionStage

Converts multi-channel audio to mono and verifies that the input sample rate matches output_sample_rate. Place it as the first stage in any quality-filtering pipeline so downstream stages can assume a consistent waveform shape.

Usage

1from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
2
3mono = MonoConversionStage(
4 output_sample_rate=48000,
5 audio_filepath_key="audio_filepath",
6 strict_sample_rate=True,
7)
8
9pipeline.add_stage(mono)

Parameters

ParameterTypeDefaultDescription
output_sample_rateint48000Required input sample rate. When strict_sample_rate=True, mismatched inputs raise; otherwise they are auto-resampled.
audio_filepath_keystr"audio_filepath"Manifest field containing the audio file path.
strict_sample_rateboolTrueIf True, raise on rate mismatch instead of resampling.

Choosing strict_sample_rate

ModeBehaviorUse Case
strict_sample_rate=True (default)Raise on rate mismatchProduction pipelines with known-good input. Surfaces unexpected data formats early.
strict_sample_rate=FalseAuto-resample to output_sample_rateHeterogeneous web crawls or mixed datasets where rate variation is expected.

Set output_sample_rate=48000 for full-band audio, 16000 for narrow-band / telephony, or match your downstream model’s training rate.

SegmentConcatenationStage

Concatenates a list of speech segments produced by an earlier VAD/filter stage back into a single waveform with configurable silence between segments. Emits a mappings field that records the original-file boundaries of each segment so TimestampMapperStage can resolve final timestamps later.

Usage

1from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
2
3concat = SegmentConcatenationStage(silence_duration_sec=0.5)
4pipeline.add_stage(concat)

Parameters

ParameterTypeDefaultDescription
silence_duration_secfloat0.5Silence inserted between concatenated segments, in seconds.

Output Mappings

After concatenation, each output AudioTask carries a mappings field — a list of dicts with one entry per concatenated segment:

1{
2 "original_file": "audio.wav",
3 "original_start_ms": 1500, # boundaries in the source file
4 "original_end_ms": 4500,
5 "concat_start_ms": 0, # position in the concatenated waveform
6 "concat_end_ms": 3000,
7 "segment_index": 0,
8}

The mappings list is what TimestampMapperStage uses to project final filtered boundaries back to the original source file.

Choosing silence_duration_sec

ValueUse Case
0.0Tightest packing; useful when downstream consumes a contiguous waveform without segment markers.
0.5 (default)Balanced — enough silence to separate segments cleanly without bloating the waveform.
1.0–2.0Useful for downstream diarization or model training where natural inter-segment silence helps the model.

TimestampMapperStage

Resolves segment positions in the concatenated waveform back to positions in the original source file. Place it at the end of the pipeline so downstream consumers see timestamps relative to the input audio, not the intermediate concatenation.

Usage

1from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
2
3mapper = TimestampMapperStage(passthrough_keys=["speaker_id", "duration_sec"])
4pipeline.add_stage(mapper)

Parameters

ParameterTypeDefaultDescription
passthrough_keyslist[str] | NoneNoneManifest keys to copy from input to output unchanged. Useful when later stages add fields (speaker_id, scores) that should travel with the mapped timestamps.

Why Pass-Through Keys Matter

After a chain like Concat → SpeakerSep → VAD → UTMOS, each segment carries fields added by intermediate stages (speaker_id from speaker separation, utmos_mos from UTMOS, etc.). Without passthrough_keys, TimestampMapperStage only writes the resolved timestamps and drops everything else. List the fields you need preserved:

1TimestampMapperStage(
2 passthrough_keys=[
3 "speaker_id",
4 "num_speakers",
5 "utmos_mos",
6 "sigmos_noise",
7 "sigmos_ovrl",
8 ]
9)

Complete Preprocessing Example

A pipeline that uses all three stages together with VAD + UTMOS in between:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
6from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
7from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
8from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
9from nemo_curator.stages.text.io.writer import JsonlWriter
10
11pipeline = Pipeline(name="audio_preprocessing")
12
13# 1. Normalize channels and sample rate
14pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
15
16# 2. Segment into speech chunks
17pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0))
18
19# 3. Quality filter
20pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
21
22# 4. Concatenate surviving segments
23pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))
24
25# 5. Resolve final boundaries back to source-file timestamps
26pipeline.add_stage(
27 TimestampMapperStage(passthrough_keys=["utmos_mos"])
28)
29
30# 6. Export
31pipeline.add_stage(AudioToDocumentStage())
32pipeline.add_stage(JsonlWriter(path="./preprocessed_audio"))
33
34executor = XennaExecutor()
35pipeline.run(executor)

Best Practices

  • Mono first, always: every downstream stage assumes a consistent waveform shape. MonoConversionStage is mandatory at the start of any pipeline that uses VAD, UTMOS, SIGMOS, or speaker separation.
  • Use strict_sample_rate=True until you have evidence it’s wrong: catching unexpected rates early is better than silently resampling and getting subtly worse results downstream.
  • Pass through fields explicitly: TimestampMapperStage is the closing stage — list everything you want preserved in passthrough_keys. It’s easier than adding a downstream stage to merge them back.
  • Skip concatenation if you want individual-segment manifests: if your downstream training pipeline reads one segment at a time, you don’t need to concatenate. Run VAD → quality filters → directly to writer; skip both SegmentConcatenationStage and TimestampMapperStage.