Duplicate Removal
Use clip-level embeddings to identify and remove near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
How it Works
Duplicate removal operates on clip-level embeddings produced during processing:
-
Inputs
- Parquet batches from
ClipWriterStageunderiv2_embd_parquet/orce1_embd_parquet/ - Columns:
id,embedding
- Parquet batches from
-
Outputs
- Cluster:
KMeansStagepartitions embeddings and writes centroid distances (for example,cosine_dist_to_cent). - Pairwise:
PairwiseStagecomputes within-cluster similarity on GPU and, for each clip, emitsmax_idandcosine_sim_score. Ranking controls whether to prefer outliers (“hard”) or representatives (“easy”). - Identify:
IdentifyDuplicatesStagefilters pairs withcosine_sim_score >= 1.0 - epsand writes Parquet files of duplicateids for removal during export.
- Cluster:
Before You Start
- Verify local paths or configure S3-compatible credentials. Provide
storage_optionsin read/write keyword arguments when reading or writing cloud paths. - Create output directories for
KMeansStage,PairwiseStage, andIdentifyDuplicatesStage.
Quickstart
Use the generic semantic duplicate-removal stages with clip embeddings written to Parquet.
Pipeline Stage
Script Flags
Input format: Parquet with columns id and embedding (produced by the video pipeline’s embedding stages and writer). Duplicate removal operates at the clip level using these embeddings. The IdentifyDuplicatesStage writes Parquet files containing duplicate ids; perform removal by filtering out rows whose id appears in those files during export.
Embeddings are written by the ClipWriterStage under iv2_embd_parquet/ or ce1_embd_parquet/. For a runnable workflow, refer to the Split and Dedup tutorial.
Parameters
KMeansStage
PairwiseStage
IdentifyDuplicatesStage
KMeansStage (semantic clustering)