Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.
ClipWriterStage under ce1_embd_parquet/. For a runnable workflow, refer to the Split and Remove Duplicates Workflow. The embeddings must be in parquet files containing the columns id and embedding.storage_options in read/write keyword arguments when reading or writing cloud paths.Duplicate identification operates on clip-level embeddings produced during processing:
Inputs
ClipWriterStage under ce1_embd_parquet/id, embeddingOutputs
KMeansStage partitions embeddings and writes centroid distances (for example, cosine_dist_to_cent).PairwiseStage computes within-cluster similarity on GPU and, for each clip, emits max_id and cosine_sim_score. Ranking controls whether to prefer outliers (“hard”) or representatives (“easy”).IdentifyDuplicatesStage filters pairs with cosine_sim_score >= 1.0 - eps and writes Parquet files of duplicate ids for removal during export.Use the semantic duplicate workflow with clip embeddings written to Parquet.
The SemanticDeduplicationWorkflow provides an end-to-end interface that orchestrates K-means clustering, pairwise similarity computation, and duplicate identification:
Determine eps first: Before running the full workflow, we recommend first running K-means and pairwise steps (set eps=None) to inspect similarity distributions and determine an appropriate eps threshold. See the tip below for details.
The workflow automatically:
eps thresholdoutput_path/duplicates/For detailed information about how semantic deduplication works, see Semantic Deduplication. The algorithm and concepts are the same for video clips as for text documents.
For advanced users who need fine-grained control, you can run the stages individually:
No example script flags are available for duplicate identification in the split pipeline. Run these stages as a separate job against Parquet embeddings written by the example pipeline’s writer.
Recommended Workflow: Determine eps First
The eps parameter is highly data-dependent and affects how many duplicates are identified. We recommend a two-step approach:
Step 1: Run K-means and pairwise without duplicate identification
SemanticDeduplicationWorkflow with eps=None (or run K-means and pairwise stages individually)Step 2: Inspect the similarity distribution
cosine_sim_score values in the pairwise resultseps threshold based on your data characteristicseps=0.1 (since cosine_sim >= 1.0 - eps)Step 3: Run the full workflow with your chosen eps
SemanticDeduplicationWorkflow with the determined eps valueIdentifyDuplicatesStage separately on the pairwise resultsFor a detailed example of this workflow with similarity analysis, see the Step-by-Step Semantic Deduplication tutorial (demonstrated on text data, but the approach applies to video clips as well).
Custom Ranking with Metadata Columns
If your embedding Parquet files contain additional metadata columns (such as video quality scores, duration, resolution, or other clip attributes), you can use RankingStrategy.metadata_based() to create custom ranking methods. This allows you to prioritize which clips to keep within duplicate groups based on your specific criteria.
For example, to prefer higher quality or longer duration clips:
The metadata columns must be present in your embedding Parquet files and will be preserved through the K-means stage. Specify these columns using the metadata_fields parameter in KMeansStage or SemanticDeduplicationWorkflow.
The SemanticDeduplicationWorkflow accepts parameters from all three stages (KMeansStage, PairwiseStage, and IdentifyDuplicatesStage). See the tabs above for parameter descriptions.
For parameters shared with individual stages, refer to:
input_path, output_path, n_clusters, id_field, embedding_field, embedding_dimranking_strategy, pairwise_batch_sizeepsread_kwargs, write_kwargs, verboseThe duplicate identification stages (IdentifyDuplicatesStage or SemanticDeduplicationWorkflow with eps specified) write Parquet files containing duplicate clip IDs to the output directory (typically output_path/duplicates/). These files contain a single column id with the IDs of clips that should be removed.
It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset. The removal process depends on how you want to persist and shard your data: