Video Data Loading#
Load video data for curation using NeMo Curator.
How it Works#
NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:
VideoReader is a composite stage that is broken down into a
Partitioning (list files) stage
Local paths use
FilePartitioningStageto list filesRemote URLs (for example,
s3://,gcs://)use
ClientPartitioningStagebacked byfsspec.Optional
input_list_json_pathallows explicit file lists under a root prefix.
Reader stage (
VideoReaderStage)
This stage downloads the bytes (local or via
FSPath) for each listed fileCalls
video.populate_metadata()to extract resolution, fps, duration, encoding format, and other fields.
You can set
video_limitto limit the number of files to be processed; useNonefor unlimited.verbose=Trueto log detailed per-video information.
Local and Cloud#
Use VideoReader to load videos from local paths or remote URLs.
Example#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.io.video_reader import VideoReader
pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
pipe.run()
Explicit File List (JSON)#
For remote datasets, ClientPartitioningStage can use an explicit file list JSON. Each entry must be an absolute path under the specified root.
JSON Format#
[
"s3://my-bucket/datasets/videos/video1.mp4",
"s3://my-bucket/datasets/videos/video2.mkv",
"s3://my-bucket/datasets/more_videos/video3.webm"
]
If any entry is outside the root, the stage raises an error.
Example#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.client_partitioning import ClientPartitioningStage
from nemo_curator.stages.video.io.video_reader import VideoReaderStage
ROOT = "s3://my-bucket/datasets/"
JSON_LIST = "s3://my-bucket/lists/videos.json"
pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
pipe.add_stage(
ClientPartitioningStage(
file_paths=ROOT,
input_list_json_path=JSON_LIST,
files_per_partition=1,
file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
)
)
pipe.add_stage(VideoReaderStage(verbose=True))
pipe.run()
Supported File Types#
The loader filters these video extensions by default:
.mp4.mov.avi.mkv.webm
Metadata on Load#
After a successful read, the loader populates the following metadata fields for each video:
size(bytes)width,heightframeratenum_framesduration(seconds)video_codec,pixel_format,audio_codecbit_rate_k
Note
With verbose=True, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video.