Use clip-level embeddings to identify and remove near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
Duplicate removal operates on clip-level embeddings produced during processing:
Inputs
ClipWriterStage under iv2_embd_parquet/ or ce1_embd_parquet/id, embeddingOutputs
KMeansStage partitions embeddings and writes centroid distances (for example, cosine_dist_to_cent).PairwiseStage computes within-cluster similarity on GPU and, for each clip, emits max_id and cosine_sim_score. Ranking controls whether to prefer outliers (“hard”) or representatives (“easy”).IdentifyDuplicatesStage filters pairs with cosine_sim_score >= 1.0 - eps and writes Parquet files of duplicate ids for removal during export.storage_options in read/write keyword arguments when reading or writing cloud paths.KMeansStage, PairwiseStage, and IdentifyDuplicatesStage.Use the generic semantic duplicate-removal stages with clip embeddings written to Parquet.
Input format: Parquet with columns id and embedding (produced by the video pipeline’s embedding stages and writer). Duplicate removal operates at the clip level using these embeddings. The IdentifyDuplicatesStage writes Parquet files containing duplicate ids; perform removal by filtering out rows whose id appears in those files during export.
Embeddings are written by the ClipWriterStage under iv2_embd_parquet/ or ce1_embd_parquet/. For a runnable workflow, refer to the Split and Dedup tutorial.
KMeansStage (semantic clustering)