Curate VideoProcess Data

Duplicate Identification

View as Markdown

Use clip-level embeddings to identify near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.

Before You Start

  • Make sure you have embeddings which are written by the ClipWriterStage under ce1_embd_parquet/. For a runnable workflow, refer to the Split and Remove Duplicates Workflow. The embeddings must be in parquet files containing the columns id and embedding.
  • Verify local paths or configure S3-compatible credentials. Provide storage_options in read/write keyword arguments when reading or writing cloud paths.

How it Works

Duplicate identification operates on clip-level embeddings produced during processing:

  1. Inputs

    • Parquet batches from ClipWriterStage under ce1_embd_parquet/
    • Columns: id, embedding
  2. Outputs

    • Cluster: KMeansStage partitions embeddings and writes centroid distances (for example, cosine_dist_to_cent).
    • Pairwise: PairwiseStage computes within-cluster similarity on GPU and, for each clip, emits max_id and cosine_sim_score. Ranking controls whether to prefer outliers (“hard”) or representatives (“easy”).
    • Identify: IdentifyDuplicatesStage filters pairs with cosine_sim_score >= 1.0 - eps and writes Parquet files of duplicate ids for removal during export.

Quickstart

Use the semantic duplicate workflow with clip embeddings written to Parquet.

The SemanticDeduplicationWorkflow provides an end-to-end interface that orchestrates K-means clustering, pairwise similarity computation, and duplicate identification:

1from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
2from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
3from nemo_curator.backends.xenna import XennaExecutor
4
5workflow = SemanticDeduplicationWorkflow(
6 input_path="/path/to/embeddings/", # e.g., ce1_embd_parquet/
7 output_path="/path/to/duplicates/",
8 cache_path="/path/to/cache/", # Optional: defaults to output_path
9 n_clusters=1000,
10 id_field="id",
11 embedding_field="embedding",
12 embedding_dim=768, # Embedding dimension (768 for Cosmos-Embed1, varies by model)
13 input_filetype="parquet",
14 eps=0.1, # Similarity threshold: cosine_sim >= 1.0 - eps identifies duplicates
15 ranking_strategy=RankingStrategy.metadata_based(
16 metadata_cols=["cosine_dist_to_cent", "id"],
17 ascending=[True, True],
18 ),
19 pairwise_batch_size=1024,
20 read_kwargs={"storage_options": None}, # Add S3 credentials here if needed
21 write_kwargs={"storage_options": None},
22 verbose=True,
23)
24
25# Run with XennaExecutor (GPU-accelerated)
26executor = XennaExecutor()
27results = workflow.run(executor)

Determine eps first: Before running the full workflow, we recommend first running K-means and pairwise steps (set eps=None) to inspect similarity distributions and determine an appropriate eps threshold. See the tip below for details.

The workflow automatically:

  1. Runs K-means clustering to partition embeddings into clusters
  2. Computes pairwise similarity within each cluster
  3. Identifies duplicates based on the eps threshold
  4. Writes duplicate IDs to output_path/duplicates/

For detailed information about how semantic deduplication works, see Semantic Deduplication. The algorithm and concepts are the same for video clips as for text documents.

For advanced users who need fine-grained control, you can run the stages individually:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
3from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
4from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
5from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
6
7pipe = Pipeline(name="semantic_dedup")
8
9pipe.add_stage(
10 KMeansStage(
11 n_clusters=1000,
12 id_field="id",
13 embedding_field="embedding",
14 input_path="/path/to/embeddings/",
15 output_path="/path/to/kmeans_out/",
16 input_filetype="parquet",
17 embedding_dim=512,
18 )
19)
20
21pipe.add_stage(
22 PairwiseStage(
23 id_field="id",
24 embedding_field="embedding",
25 input_path="/path/to/kmeans_out/",
26 output_path="/path/to/pairwise_out/",
27 ranking_strategy=RankingStrategy.metadata_based(
28 metadata_cols=["cosine_dist_to_cent", "id"],
29 ascending=[True, True],
30 ),
31 )
32)
33
34pipe.add_stage(
35 IdentifyDuplicatesStage(
36 output_path="/path/to/duplicates/",
37 eps=0.1,
38 )
39)
40
41pipe.run()

No example script flags are available for duplicate identification in the split pipeline. Run these stages as a separate job against Parquet embeddings written by the example pipeline’s writer.

Recommended Workflow: Determine eps First

The eps parameter is highly data-dependent and affects how many duplicates are identified. We recommend a two-step approach:

  1. Step 1: Run K-means and pairwise without duplicate identification

    • Use SemanticDeduplicationWorkflow with eps=None (or run K-means and pairwise stages individually)
    • This generates pairwise similarity scores without identifying duplicates
  2. Step 2: Inspect the similarity distribution

    • Analyze the cosine_sim_score values in the pairwise results
    • Determine an appropriate eps threshold based on your data characteristics
    • For example, if 20% of pairs have similarity ≥ 0.9, you might use eps=0.1 (since cosine_sim >= 1.0 - eps)
  3. Step 3: Run the full workflow with your chosen eps

    • Use SemanticDeduplicationWorkflow with the determined eps value
    • Or run IdentifyDuplicatesStage separately on the pairwise results

For a detailed example of this workflow with similarity analysis, see the Step-by-Step Semantic Deduplication tutorial (demonstrated on text data, but the approach applies to video clips as well).

Custom Ranking with Metadata Columns

If your embedding Parquet files contain additional metadata columns (such as video quality scores, duration, resolution, or other clip attributes), you can use RankingStrategy.metadata_based() to create custom ranking methods. This allows you to prioritize which clips to keep within duplicate groups based on your specific criteria.

For example, to prefer higher quality or longer duration clips:

1from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
2
3# Prefer clips with higher quality scores, then longer duration
4ranking_strategy = RankingStrategy.metadata_based(
5 metadata_cols=["quality_score", "duration"],
6 ascending=[False, False], # False = descending (higher is better)
7)
8
9# Or prefer clips closer to cluster centroid, then by quality
10ranking_strategy = RankingStrategy.metadata_based(
11 metadata_cols=["cosine_dist_to_cent", "quality_score"],
12 ascending=[True, False], # Closer to centroid first, then higher quality
13)

The metadata columns must be present in your embedding Parquet files and will be preserved through the K-means stage. Specify these columns using the metadata_fields parameter in KMeansStage or SemanticDeduplicationWorkflow.

Parameters

ParameterDescription
n_clustersNumber of clusters for K‑means (for example, 1,000+ for multi‑million clip sets).
id_fieldColumn name containing clip IDs (for example, "id").
embedding_fieldColumn with vector data (for example, "embedding").
input_pathPath to Parquet embeddings directory from the writer.
output_pathDirectory for K‑means outputs (sharded by cluster).
input_filetypeUse "parquet" for video embeddings.
embedding_dimEmbedding dimension (Cosmos‑Embed1 varies by variant: 768 for most).
ParameterDescription
ranking_strategyRanking strategy for selecting which clips to keep within clusters. Use RankingStrategy.metadata_based(metadata_cols=[...], ascending=[...]) to sort by metadata columns (for example, metadata_cols=["cosine_dist_to_cent", "id"]). Use RankingStrategy.random() for random selection.
pairwise_batch_sizeBatch size for GPU pairwise computation (default 1024). Increase with available memory.
embedding_dimEmbedding dimension for memory estimates and batching.
id_fieldColumn name containing clip IDs (for example, "id").
embedding_fieldColumn with vector data (for example, "embedding").
input_pathPath to K-means output directory (sharded by cluster).
output_pathDirectory for pairwise similarity outputs.
ParameterDescription
output_pathDirectory to write Parquet files containing duplicate ids.
epsSimilarity threshold: pairs with cosine_sim_score >= 1.0 - eps are identified as duplicates (for example, 0.1 means similarity >= 0.9).
read_kwargsOptional keyword arguments for reading files (including storage_options for cloud storage).
write_kwargsOptional keyword arguments for writing files (including storage_options for cloud storage).
verboseEnable verbose logging (default False).

The SemanticDeduplicationWorkflow accepts parameters from all three stages (KMeansStage, PairwiseStage, and IdentifyDuplicatesStage). See the tabs above for parameter descriptions.

ParameterDescription
cache_pathDirectory for intermediate results (K-means and pairwise outputs). Defaults to output_path if not specified.
cache_kwargsOptional keyword arguments for writing cache files (including storage_options for cloud storage). Defaults to write_kwargs if not specified.
clear_outputClear output directory before running (default True).
metadata_fieldsList of metadata field names to preserve in output (optional).

For parameters shared with individual stages, refer to:

  • KMeansStage tab: input_path, output_path, n_clusters, id_field, embedding_field, embedding_dim
  • PairwiseStage tab: ranking_strategy, pairwise_batch_size
  • IdentifyDuplicatesStage tab: eps
  • Common parameters: read_kwargs, write_kwargs, verbose

Removing Duplicates

The duplicate identification stages (IdentifyDuplicatesStage or SemanticDeduplicationWorkflow with eps specified) write Parquet files containing duplicate clip IDs to the output directory (typically output_path/duplicates/). These files contain a single column id with the IDs of clips that should be removed.

It is your responsibility to exclude these duplicate IDs when exporting or persisting your final dataset. The removal process depends on how you want to persist and shard your data: