Curate VideoProcess Data

Video Clipping

View as Markdown

Split long videos into shorter clips for downstream processing.

How it Works

NeMo Curator provides two clipping stages: Fixed Stride and TransNetV2 scene-change detection.

  • Use Fixed Stride to create uniform segments.
  • Use TransNetV2 to cut at visual shot boundaries.

Before You Start

Ensure inputs contain video bytes and basic metadata. The clipping stages require video.source_bytes to be present and metadata with framerate and num_frames.


Quickstart

Use either the pipeline stages or the example script flags to create clips.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.video.clipping.clip_extraction_stages import (
3 FixedStrideExtractorStage,
4)
5from nemo_curator.stages.video.clipping.video_frame_extraction import (
6 VideoFrameExtractionStage,
7)
8from nemo_curator.stages.video.clipping.transnetv2_extraction import (
9 TransNetV2ClipExtractionStage,
10)
11
12pipe = Pipeline(name="clipping_examples")
13
14# Fixed Stride
15pipe.add_stage(
16 FixedStrideExtractorStage(
17 clip_len_s=10.0,
18 clip_stride_s=10.0,
19 min_clip_length_s=2.0,
20 limit_clips=0,
21 )
22)
23
24# TransNetV2 (requires full-video frame extraction first)
25pipe.add_stage(VideoFrameExtractionStage(decoder_mode="pynvc", verbose=True))
26pipe.add_stage(
27 TransNetV2ClipExtractionStage(
28 model_dir="/models",
29 threshold=0.4,
30 min_length_s=2.0,
31 max_length_s=10.0,
32 max_length_mode="stride",
33 crop_s=0.5,
34 gpu_memory_gb=10,
35 limit_clips=-1,
36 verbose=True,
37 )
38)
39
40pipe.run()

Clipping Options

Fixed Stride

The FixedStrideExtractorStage steps through the video duration by clip_stride_s, creating spans of length clip_len_s (it truncates the final span at the video end when needed). It filters spans shorter than min_clip_length_s and appends Clip objects identified by source and frame indices.

1from nemo_curator.stages.video.clipping.clip_extraction_stages import FixedStrideExtractorStage
2
3stage = FixedStrideExtractorStage(
4 clip_len_s=10.0,
5 clip_stride_s=10.0,
6 min_clip_length_s=2.0,
7 limit_clips=0,
8)

If limit_clips is greater than 0 and the Video already has clips, the stage skips processing. It does not cap the number of clips generated within the same run.

TransNetV2 Scene-Change Detection

TransNetV2 is a shot-boundary detection model that identifies transitions between shots. The stage converts those transitions into scenes, applies length/crop rules, and emits clips aligned to scene boundaries.

Using extracted frames of size 27×48×3, the model predicts shot transitions, converts them into scenes, and applies filtering: min_length_s, max_length_s with max_length_mode (“truncate” or “stride”), and optional crop_s at both ends. It creates Clip objects for the resulting spans, then stops after it reaches limit_clips (greater than 0), and releases frames from memory after processing.

  1. Run VideoFrameExtractionStage first to populate video.frame_array.

    1from nemo_curator.stages.video.clipping.video_frame_extraction import VideoFrameExtractionStage
    2from nemo_curator.stages.video.clipping.transnetv2_extraction import TransNetV2ClipExtractionStage
    3
    4frame_extractor = VideoFrameExtractionStage(
    5 decoder_mode="pynvc", # or "ffmpeg_gpu", "ffmpeg_cpu"
    6 verbose=True,
    7)

    Frames must be (27, 48, 3) per frame; the stage accepts arrays shaped (num_frames, 27, 48, 3) and transposes from (48, 27, 3) automatically.

  2. Configure TransNetV2 and run the stage in your pipeline to generate clips from the detected scenes.

    1transnet = TransNetV2ClipExtractionStage(
    2 model_dir="/models",
    3 threshold=0.4,
    4 min_length_s=2.0,
    5 max_length_s=10.0,
    6 max_length_mode="stride", # or "truncate"
    7 crop_s=0.5,
    8 gpu_memory_gb=10,
    9 limit_clips=-1,
    10 verbose=True,
    11)