Load Data | NeMo Curator

Load video data for curation using NeMo Curator.

How it Works

NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:

VideoReader is a composite stage that is broken down into a

Partitioning (list files) stage

Local paths use FilePartitioningStage to list files
Remote URLs (for example, s3://, gcs://)
- use ClientPartitioningStage backed by fsspec.
- Optional input_list_json_path allows explicit file lists under a root prefix.

Reader stage (VideoReaderStage)

This stage downloads the bytes (local or via FSPath) for each listed file
Calls video.populate_metadata() to extract resolution, fps, duration, encoding format, and other fields.

You can set

video_limit to limit the number of files to be processed; use None for unlimited.
verbose=True to log detailed per-video information.

Local and Cloud

Use VideoReader to load videos from local paths or remote URLs.

Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.video.io.video_reader import VideoReader
3 
4 pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
5 pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
6 pipe.run()

Explicit File List (JSON)

For remote datasets, ClientPartitioningStage can use an explicit file list JSON. Each entry must be an absolute path under the specified root.

JSON Format

1 [
2   "s3://my-bucket/datasets/videos/video1.mp4",
3   "s3://my-bucket/datasets/videos/video2.mkv",
4   "s3://my-bucket/datasets/more_videos/video3.webm"
5 ]

If any entry is outside the root, the stage raises an error.

Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.client_partitioning import ClientPartitioningStage
3 from nemo_curator.stages.video.io.video_reader import VideoReaderStage
4 
5 ROOT = "s3://my-bucket/datasets/"
6 JSON_LIST = "s3://my-bucket/lists/videos.json"
7 
8 pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
9 pipe.add_stage(
10     ClientPartitioningStage(
11         file_paths=ROOT,
12         input_list_json_path=JSON_LIST,
13         files_per_partition=1,
14         file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
15     )
16 )
17 pipe.add_stage(VideoReaderStage(verbose=True))
18 pipe.run()

Supported File Types

The loader filters these video extensions by default:

.mp4
.mov
.avi
.mkv
.webm

Metadata on Load

After a successful read, the loader populates the following metadata fields for each video:

size (bytes)
width, height
framerate
num_frames
duration (seconds)
video_codec, pixel_format, audio_codec
bit_rate_k

With verbose=True, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video.