***

description: >-
Load video data into NeMo Curator from local paths or fsspec-supported
storage, including explicit file list support
categories:

* video-curation
  tags:
* video
* load
* s3
* local
* file-list
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: howto
  modality: video-only

***

# Video Data Loading

Load video data for curation using NeMo Curator.

## How it Works

NeMo Curator loads videos with a composite stage that discovers files and extracts metadata:

`VideoReader` is a composite stage that is broken down into a

1. Partitioning (list files) stage

* Local paths use `FilePartitioningStage` to list files
* Remote URLs (for example, `s3://`, `gcs://`)
  * use `ClientPartitioningStage` backed by `fsspec`.
  * Optional `input_list_json_path` allows explicit file lists under a root prefix.

2. Reader stage (`VideoReaderStage`)

* This stage downloads the bytes (local or via `FSPath`) for each listed file
* Calls `video.populate_metadata()` to extract resolution, fps, duration, encoding format, and other fields.

You can set

* `video_limit` to limit the number of files to be processed; use `None` for unlimited.
* `verbose=True` to log detailed per-video information.

***

## Local and Cloud

Use `VideoReader` to load videos from local paths or remote URLs.

### Example

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.io.video_reader import VideoReader

pipe = Pipeline(name="video_read", description="Read videos and extract metadata")
pipe.add_stage(VideoReader(input_video_path="s3://my-bucket/videos/", video_limit=None, verbose=True))
pipe.run()
```

## Explicit File List (JSON)

For remote datasets, `ClientPartitioningStage` can use an explicit file list JSON. Each entry must be an absolute path under the specified root.

### JSON Format

```json
[
  "s3://my-bucket/datasets/videos/video1.mp4",
  "s3://my-bucket/datasets/videos/video2.mkv",
  "s3://my-bucket/datasets/more_videos/video3.webm"
]
```

If any entry is outside the root, the stage raises an error.

### Example

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.client_partitioning import ClientPartitioningStage
from nemo_curator.stages.video.io.video_reader import VideoReaderStage

ROOT = "s3://my-bucket/datasets/"
JSON_LIST = "s3://my-bucket/lists/videos.json"

pipe = Pipeline(name="video_read_json_list", description="Read specific videos via JSON list")
pipe.add_stage(
    ClientPartitioningStage(
        file_paths=ROOT,
        input_list_json_path=JSON_LIST,
        files_per_partition=1,
        file_extensions=[".mp4", ".mov", ".avi", ".mkv", ".webm"],
    )
)
pipe.add_stage(VideoReaderStage(verbose=True))
pipe.run()
```

## Supported File Types

The loader filters these video extensions by default:

* `.mp4`
* `.mov`
* `.avi`
* `.mkv`
* `.webm`

## Metadata on Load

After a successful read, the loader populates the following metadata fields for each video:

* `size` (bytes)
* `width`, `height`
* `framerate`
* `num_frames`
* `duration` (seconds)
* `video_codec`, `pixel_format`, `audio_codec`
* `bit_rate_k`

<Note>
  With `verbose=True`, the loader logs size, resolution, fps, duration, weight, and bit rate for each processed video.
</Note>
