Split long videos into shorter clips for downstream processing.
NeMo Curator provides two clipping stages: Fixed Stride and TransNetV2 scene-change detection.
Ensure inputs contain video bytes and basic metadata. The clipping stages require video.source_bytes to be present and metadata with framerate and num_frames.
Use either the pipeline stages or the example script flags to create clips.
The FixedStrideExtractorStage steps through the video duration by clip_stride_s, creating spans of length clip_len_s (it truncates the final span at the video end when needed). It filters spans shorter than min_clip_length_s and appends Clip objects identified by source and frame indices.
If limit_clips is greater than 0 and the Video already has clips, the stage skips processing. It does not cap the number of clips generated within the same run.
TransNetV2 is a shot-boundary detection model that identifies transitions between shots. The stage converts those transitions into scenes, applies length/crop rules, and emits clips aligned to scene boundaries.
Using extracted frames of size 27×48×3, the model predicts shot transitions, converts them into scenes, and applies filtering: min_length_s, max_length_s with max_length_mode (“truncate” or “stride”), and optional crop_s at both ends. It creates Clip objects for the resulting spans, then stops after it reaches limit_clips (greater than 0), and releases frames from memory after processing.
Run VideoFrameExtractionStage first to populate video.frame_array.
Frames must be (27, 48, 3) per frame; the stage accepts arrays shaped (num_frames, 27, 48, 3) and transposes from (48, 27, 3) automatically.
Configure TransNetV2 and run the stage in your pipeline to generate clips from the detected scenes.