Get Started with Video Curation
This guide shows how to install Curator and run your first video curation pipeline.
The example pipeline processes a list of videos, splitting each into 10‑second clips using a fixed stride. It then generates clip‑level embeddings for downstream tasks such as duplicate removal and similarity search.
Overview
This quickstart guide demonstrates how to:
- Install NeMo Curator with video processing support
- Set up FFmpeg with GPU-accelerated encoding
- Configure embedding models (Cosmos-Embed1)
- Process videos through a complete splitting and embedding pipeline
- Generate outputs ready for duplicate removal, captioning, and model training
What you build: A video processing pipeline that:
- Splits videos into 10-second clips using fixed stride or scene detection
- Generates clip-level embeddings for similarity search and deduplication
- Optionally creates captions and preview images
- Outputs results in formats compatible with multimodal training workflows
Prerequisites
System Requirements
To use NeMo Curator’s video curation capabilities, ensure your system meets these requirements:
Operating System
- Ubuntu 24.04, 22.04, or 20.04 (required for GPU-accelerated video processing)
- Other Linux distributions may work but are not officially supported
Python Environment
- Python 3.10, 3.11, or 3.12
- uv package manager for dependency management
- Git for model and repository dependencies
GPU Requirements
- NVIDIA GPU required (CPU-only mode not supported for video processing)
- Architecture: Volta™ or newer (compute capability 7.0+)
- Examples: V100, T4, RTX 2080+, A100, H100
- CUDA: Version 12.0 or above
- VRAM: Minimum requirements by configuration:
- Basic splitting + embedding: ~16GB VRAM
- Full pipeline (splitting + embedding + captioning): ~38GB VRAM
- Reduced configuration (lower batch sizes, FP8): ~21GB VRAM
Software Dependencies
- FFmpeg 8.0+ with H.264 encoding support
- GPU encoder:
h264_nvenc(recommended for performance) - CPU encoders:
libopenh264orlibx264(fallback options)
- GPU encoder:
If uv is not installed, refer to the Installation Guide for setup instructions, or install it quickly with:
Install
Create and activate a virtual environment, then choose an install option:
PyPI
Source
NeMo Curator Container
Install FFmpeg and Encoders
Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (for example, using --transcode-encoder libopenh264 or h264_nvenc), install FFmpeg with the corresponding encoders.
Debian/Ubuntu (Script)
Verify Installation
Use the maintained script in the repository to build and install FFmpeg with libopenh264 and NVIDIA NVENC support. The script enables --enable-libopenh264, --enable-cuda-nvcc, and --enable-libnpp.
- Script source: docker/common/install_ffmpeg.sh
Refer to Clip Encoding to choose encoders and verify NVENC support on your system.
Available Models
Embeddings convert each video clip into a numeric vector that captures visual and semantic content. Curator uses these vectors to:
- Remove near-duplicate clips during duplicate removal
- Enable similarity search and clustering
- Support downstream analysis such as caption verification
NeMo Curator supports two embedding model families:
Cosmos-Embed1 (Recommended)
Cosmos-Embed1 (default): Available in three variants—cosmos-embed1-224p, cosmos-embed1-336p, and cosmos-embed1-448p—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to MODEL_DIR on first run.
Model links:
- cosmos-embed1-224p on Hugging Face
- cosmos-embed1-336p on Hugging Face
- cosmos-embed1-448p on Hugging Face
For this quickstart, the following steps set up support for Cosmos-Embed1-224p.
Prepare Model Weights
For most use cases, you only need to create a model directory. The required model files will be downloaded automatically on first run.
-
Create a model directory:
You can reuse the same
<MODEL_DIR>across runs. -
No additional setup is required. The model will be downloaded automatically when first used.
Set Up Data Directories
Organize input videos and output locations before running the pipeline.
-
Local: For local file processing. Define paths like:
-
S3: For cloud storage (AWS S3, MinIO, etc.). Configure credentials in
~/.aws/credentialsand uses3://paths for--video-dirand--output-clip-path.
S3 usage notes:
- Input videos can be read from S3 paths
- Output clips can be written to S3 paths
- Model directory should remain local for performance
- Ensure IAM permissions allow read/write access to specified buckets
Run the Splitting Pipeline Example
Use the example script from https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video/getting-started to read videos, split into clips, and write outputs. This runs a Ray pipeline with XennaExecutor under the hood.
What this command does:
- Reads all video files from
$DATA_DIR - Splits each video into 10-second clips using fixed stride
- Generates embeddings using Cosmos-Embed1-224p model
- Encodes clips using libopenh264 codec
- Writes output clips and metadata to
$OUT_DIR
Using a config file: The example script accepts many command-line arguments. For complex configurations, you can store arguments in a file and pass them with the @ prefix:
echo ‘—video-dir /data/videos —output-clip-path /data/output —splitting-algorithm fixed_stride —fixed-stride-split-duration 10.0 —embedding-algorithm cosmos-embed1-224p —transcode-encoder libopenh264’ > my_config.txt
python tutorials/video/getting-started/video_split_clip_example.py @my_config.txt
Configuration Options Reference
Understanding Pipeline Output
After successful execution, the output directory will contain:
File descriptions:
- clips/: Encoded video clips (MP4 format)
- embeddings/: Numpy arrays containing clip embeddings (for similarity search)
- metadata/manifest.jsonl: JSONL file with clip metadata (paths, timestamps, embeddings)
- previews/: Thumbnail images for each clip (optional)
Example manifest entry:
Best Practices
Data Preparation
- Validate input videos: Ensure videos are not corrupted before processing
- Consistent formats: Convert videos to a standard format (MP4 with H.264) for consistent results
- Organize by content: Group similar videos together for efficient processing
Model Selection
- Start with Cosmos-Embed1-224p: Best balance of speed and quality for initial experiments
- Upgrade resolution as needed: Use 336p or 448p only when higher precision is required
- Monitor VRAM usage: Check GPU memory with
nvidia-smiduring processing
Pipeline Configuration
- Enable verbose logging: Use
--verboseflag for debugging and monitoring - Test on small subset: Run pipeline on 5-10 videos before processing large datasets
- Use GPU encoding: Enable NVENC for significant performance improvements
- Save intermediate results: Keep embeddings and metadata for downstream tasks
Infrastructure
- Use shared storage: Mount shared filesystem for multi-node processing
- Allocate sufficient VRAM: Plan for peak usage (captioning + embedding)
- Monitor GPU utilization: Use
nvidia-smi dmonto track GPU usage during processing - Schedule long-running jobs: Process large video datasets in batch jobs overnight
Next Steps
Explore the Video Curation documentation. For encoding guidance, refer to Clip Encoding.