Video Data Loading#
Load video data for curation using NeMo Curator.
How it Works#
NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:
VideoReaderdecomposes into a partitioning stage plus a reader stage.Local paths use
FilePartitioningStageto list files; remote URLs (for example,s3://,gcs://,http(s)://) useClientPartitioningStagebacked byfsspec.For remote datasets, you can optionally supply an explicit file list using
ClientPartitioningStage.input_list_json_path.VideoReaderStagedownloads bytes (local or viaFSPath) and callsvideo.populate_metadata()to extract resolution, fps, duration, encoding format, and other fields.Set
video_limitto cap discovery; useNonefor unlimited. Setverbose=Trueto log detailed per-video information.
Local and Cloud#
Use VideoReader to load videos from local paths or remote URLs.
Local Paths#
Examples:
/data/videos/,/mnt/datasets/av/Uses
FilePartitioningStageto recursively discover files.Filters by extensions:
.mp4,.mov,.avi,.mkv,.webm.Set
video_limitto cap discovery during testing (Nonemeans unlimited).
Remote Paths#
Examples:
s3://bucket/path/,gcs://bucket/path/,https://host/path/, and other fsspec-supported protocols such ass3a://andabfs://.Uses
ClientPartitioningStagebacked byfsspecto list files.Optional
input_list_json_pathallows explicit file lists under a root prefix.Wraps entries as
FSPathfor efficient byte access during reading.
Tip
Use an object storage prefix (for example, s3://my-bucket/videos/) to stream from cloud storage. Configure credentials in your environment or client configuration.
Example#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.io.video_reader import VideoReader
pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
pipe.run()
Explicit File List (JSON)#
For remote datasets, ClientPartitioningStage can use an explicit file list JSON. Each entry must be an absolute path under the specified root.
JSON Format#
[
"s3://my-bucket/datasets/videos/video1.mp4",
"s3://my-bucket/datasets/videos/video2.mkv",
"s3://my-bucket/datasets/more_videos/video3.webm"
]
If any entry is outside the root, the stage raises an error.
Example#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.client_partitioning import ClientPartitioningStage
from nemo_curator.stages.video.io.video_reader import VideoReaderStage
ROOT = "s3://my-bucket/datasets/"
JSON_LIST = "s3://my-bucket/lists/videos.json"
pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
pipe.add_stage(
ClientPartitioningStage(
file_paths=ROOT,
input_list_json_path=JSON_LIST,
files_per_partition=1,
file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
)
)
pipe.add_stage(VideoReaderStage(verbose=True))
pipe.run()
Supported File Types#
The loader filters these video extensions by default:
.mp4.mov.avi.mkv.webm
Metadata on Load#
After a successful read, the loader populates the following metadata fields for each video:
size(bytes)width,heightframeratenum_framesduration(seconds)video_codec,pixel_format,audio_codecbit_rate_k
Note
With verbose=True, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video.