Video Data Loading

Load video data for curation using NeMo Curator.

How it Works

NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:

VideoReader decomposes into a partitioning stage plus a reader stage.
Local paths use FilePartitioningStage to list files; remote URLs (for example, s3://, gcs://, http(s)://) use ClientPartitioningStage backed by fsspec.
For remote datasets, you can optionally supply an explicit file list using ClientPartitioningStage.input_list_json_path.
VideoReaderStage downloads bytes (local or via FSPath) and calls video.populate_metadata() to extract resolution, fps, duration, encoding format, and other fields.
Set video_limit to cap discovery; use None for unlimited. Set verbose=True to log detailed per-video information.

Local and Cloud

Use VideoReader to load videos from local paths or remote URLs.

Local Paths

Examples: /data/videos/, /mnt/datasets/av/
Uses FilePartitioningStage to recursively discover files.
Filters by extensions: .mp4, .mov, .avi, .mkv, .webm.
Set video_limit to cap discovery during testing (None means unlimited).

Remote Paths

Examples: s3://bucket/path/, gcs://bucket/path/, https://host/path/, and other fsspec-supported protocols such as s3a:// and abfs://.
Uses ClientPartitioningStage backed by fsspec to list files.
Optional input_list_json_path allows explicit file lists under a root prefix.
Wraps entries as FSPath for efficient byte access during reading.

Use an object storage prefix (for example, s3://my-bucket/videos/) to stream from cloud storage. Configure credentials in your environment or client configuration.

Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.video.io.video_reader import VideoReader
3 
4 pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
5 pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
6 pipe.run()

Explicit File List (JSON)

For remote datasets, ClientPartitioningStage can use an explicit file list JSON. Each entry must be an absolute path under the specified root.

JSON Format

1 [
2   "s3://my-bucket/datasets/videos/video1.mp4",
3   "s3://my-bucket/datasets/videos/video2.mkv",
4   "s3://my-bucket/datasets/more_videos/video3.webm"
5 ]

If any entry is outside the root, the stage raises an error.

Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.client_partitioning import ClientPartitioningStage
3 from nemo_curator.stages.video.io.video_reader import VideoReaderStage
4 
5 ROOT = "s3://my-bucket/datasets/"
6 JSON_LIST = "s3://my-bucket/lists/videos.json"
7 
8 pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
9 pipe.add_stage(
10     ClientPartitioningStage(
11         file_paths=ROOT,
12         input_list_json_path=JSON_LIST,
13         files_per_partition=1,
14         file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
15     )
16 )
17 pipe.add_stage(VideoReaderStage(verbose=True))
18 pipe.run()

Supported File Types

The loader filters these video extensions by default:

.mp4
.mov
.avi
.mkv
.webm

Metadata on Load

After a successful read, the loader populates the following metadata fields for each video:

size (bytes)
width, height
framerate
num_frames
duration (seconds)
video_codec, pixel_format, audio_codec
bit_rate_k

With verbose=True, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video.

Load video data for curation using NeMo Curator.

How it Works

NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:

VideoReader decomposes into a partitioning stage plus a reader stage.
Local paths use FilePartitioningStage to list files; remote URLs (for example, s3://, gcs://, http(s)://) use ClientPartitioningStage backed by fsspec.
For remote datasets, you can optionally supply an explicit file list using ClientPartitioningStage.input_list_json_path.
VideoReaderStage downloads bytes (local or via FSPath) and calls video.populate_metadata() to extract resolution, fps, duration, encoding format, and other fields.
Set video_limit to cap discovery; use None for unlimited. Set verbose=True to log detailed per-video information.

Local and Cloud

Use VideoReader to load videos from local paths or remote URLs.

Local Paths

Examples: /data/videos/, /mnt/datasets/av/
Uses FilePartitioningStage to recursively discover files.
Filters by extensions: .mp4, .mov, .avi, .mkv, .webm.
Set video_limit to cap discovery during testing (None means unlimited).

Remote Paths

Examples: s3://bucket/path/, gcs://bucket/path/, https://host/path/, and other fsspec-supported protocols such as s3a:// and abfs://.
Uses ClientPartitioningStage backed by fsspec to list files.
Optional input_list_json_path allows explicit file lists under a root prefix.
Wraps entries as FSPath for efficient byte access during reading.

Use an object storage prefix (for example, s3://my-bucket/videos/) to stream from cloud storage. Configure credentials in your environment or client configuration.

Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.video.io.video_reader import VideoReader
3 
4 pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
5 pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
6 pipe.run()

Explicit File List (JSON)

For remote datasets, ClientPartitioningStage can use an explicit file list JSON. Each entry must be an absolute path under the specified root.

JSON Format

1 [
2   "s3://my-bucket/datasets/videos/video1.mp4",
3   "s3://my-bucket/datasets/videos/video2.mkv",
4   "s3://my-bucket/datasets/more_videos/video3.webm"
5 ]

If any entry is outside the root, the stage raises an error.

Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.client_partitioning import ClientPartitioningStage
3 from nemo_curator.stages.video.io.video_reader import VideoReaderStage
4 
5 ROOT = "s3://my-bucket/datasets/"
6 JSON_LIST = "s3://my-bucket/lists/videos.json"
7 
8 pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
9 pipe.add_stage(
10     ClientPartitioningStage(
11         file_paths=ROOT,
12         input_list_json_path=JSON_LIST,
13         files_per_partition=1,
14         file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
15     )
16 )
17 pipe.add_stage(VideoReaderStage(verbose=True))
18 pipe.run()

Supported File Types

The loader filters these video extensions by default:

.mp4
.mov
.avi
.mkv
.webm

Metadata on Load

After a successful read, the loader populates the following metadata fields for each video:

size (bytes)
width, height
framerate
num_frames
duration (seconds)
video_codec, pixel_format, audio_codec
bit_rate_k

With verbose=True, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video.

1	from nemo_curator.pipeline import Pipeline
2	from nemo_curator.stages.video.io.video_reader import VideoReader
3
4	pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
5	pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
6	pipe.run()

1	[
2	"s3://my-bucket/datasets/videos/video1.mp4",
3	"s3://my-bucket/datasets/videos/video2.mkv",
4	"s3://my-bucket/datasets/more_videos/video3.webm"
5	]

1	from nemo_curator.pipeline import Pipeline
2	from nemo_curator.stages.client_partitioning import ClientPartitioningStage
3	from nemo_curator.stages.video.io.video_reader import VideoReaderStage
4
5	ROOT = "s3://my-bucket/datasets/"
6	JSON_LIST = "s3://my-bucket/lists/videos.json"
7
8	pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
9	pipe.add_stage(
10	ClientPartitioningStage(
11	file_paths=ROOT,
12	input_list_json_path=JSON_LIST,
13	files_per_partition=1,
14	file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
15	)
16	)
17	pipe.add_stage(VideoReaderStage(verbose=True))
18	pipe.run()