Split and Remove Duplicates Workflow
Learn how to run the splitting pipeline to generate clips and embeddings, then remove near-duplicate clips using semantic duplicate removal.
Before You Start
- Complete the Get Started guide.
1. Generate Clips and Embeddings
Run the splitting example. Set VIDEO_DIR, OUT_DIR, and MODEL_DIR first.
Writer-related flags you can add:
The pipeline writes embeddings under $OUT_DIR/ce1_embd_parquet/ when using Cosmos-Embed1.
Embedding Format Example
The pipeline writes embeddings to Parquet with two columns:
- id: String UUID for the clip
- embedding: List of float values with length equal to the model’s embedding dimension (for Cosmos-Embed1, 768)
Directory layout
Schema
Sample row
Read example
2. Run Semantic Duplicate Removal
Use K-means clustering followed by pairwise similarity on the Parquet embeddings.
which_to_keep selects the representative within each cluster: “hard” keeps outliers far from the centroid, “easy” keeps the nearest to the centroid, and “random” ignores distance and picks randomly.
sim_metric sets the distance used for similarity: “cosine” uses cosine distance (1 − cosine similarity), while “l2” uses Euclidean distance.
pairwise_batch_size controls how many items are processed per GPU batch during pairwise similarity; larger values can be faster but require more GPU memory.
3. Inspect Results
- K-means outputs per-cluster partitions under
${OUTPUT_DIR}/kmeans/. - Pairwise outputs per-cluster similarity files under
${OUTPUT_DIR}/pairwise/with columns includingid,max_id, andcosine_sim_score. - Use these to decide keep/remove policies or downstream sampling.
4. Export for Training
After duplicate removal, export curated clips and metadata for training. Common video exports:
- Parquet index + media files (mp4/webp) under
${OUT_DIR} - Tar archives (WebDataset-style) containing per-clip payloads and JSON/Parquet metadata
Video-specific pointers:
- Use
ClipWriterStagepath helpers to locate outputs:nemo_curator/stages/video/io/clip_writer.py.- Processed videos:
get_output_path_processed_videos(OUT_DIR) - Clip chunks and previews:
get_output_path_processed_clip_chunks(OUT_DIR),get_output_path_previews(OUT_DIR) - Embeddings parquet:
${OUT_DIR}/ce1_embd_parquet
- Processed videos:
Example Export
The following example packages clips and minimal JSON metadata into tar files: