Curate VideoTutorialsPipeline Customization

Adding Custom Code

View as Markdown

Learn how to extend NeMo Curator by adding custom code to a new or existing stage.

The NeMo Curator container includes a robust set of default pipelines with commonly used stages. If they do not meet your requirements, extend them with your own modules.

Before You Start

Before you begin adding custom code, make sure that you have:


How to Add Custom Code

Define New Functionality

  1. Create a custom_code directory anywhere on your system to organize your custom pipeline code.

  2. Create a new folder for your environment, for example: new_stage/.

  3. Create a new file, for example my_file.py. This file must define a class (MyClass) made available for import.

    1# your code here
  4. Import the class in your stage or pipeline code to use it.

    1from my_code.my_file import MyClass
    2
    3...
  5. Save the files.

Use your code in a pipeline

Create or edit a stage to use your code, then assemble a pipeline and run it in Python:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.base import ProcessingStage
3from nemo_curator.tasks.video import VideoTask
4from nemo_curator.stages.video.io.video_reader import VideoReader
5from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
6
7from my_code.my_file import MyClass
8
9class MyStage(ProcessingStage[VideoTask, VideoTask]):
10 def process(self, task: VideoTask) -> VideoTask | list[VideoTask]:
11 helper = MyClass()
12 # use helper with task.data (Video/Clips)
13 return task
14
15pipeline = (
16 Pipeline(name="my-pipeline")
17 .add_stage(VideoReader(input_video_path="/path/to/videos", video_limit=10))
18 .add_stage(MyStage())
19 .add_stage(
20 ClipWriterStage(
21 output_path="/path/to/output",
22 input_path="/path/to/videos",
23 upload_clips=True,
24 dry_run=False,
25 generate_embeddings=False,
26 generate_previews=False,
27 generate_captions=False,
28 embedding_algorithm="cosmos-embed1",
29 caption_models=["qwen"],
30 enhanced_caption_models=["qwen_lm"],
31 )
32 )
33)
34
35pipeline.run()

To containerize, use a Dockerfile to copy your code and install dependencies, then build and run with your preferred tooling. Prefer aligning packages with optional extras in pyproject.toml.

Next Steps

Now that you have created custom code, you can create a custom stage that uses your code.