For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
      • Overview
      • Clipping
      • Transcoding
      • Filtering
      • Embeddings
      • Deduplication
      • Frame Extraction
      • Captions Preview
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Before You Start
  • Quickstart
  • Parameters
Curate VideoProcess Data

Duplicate Removal

||View as Markdown|
Previous

Embeddings

Next

Frame Extraction

Use clip-level embeddings to identify and remove near-duplicate video clips so your dataset remains compact, diverse, and efficient to train on.

How it Works

Duplicate removal operates on clip-level embeddings produced during processing:

  1. Inputs

    • Parquet batches from ClipWriterStage under iv2_embd_parquet/ or ce1_embd_parquet/
    • Columns: id, embedding
  2. Outputs

    • Cluster: KMeansStage partitions embeddings and writes centroid distances (for example, cosine_dist_to_cent).
    • Pairwise: PairwiseStage computes within-cluster similarity on GPU and, for each clip, emits max_id and cosine_sim_score. Ranking controls whether to prefer outliers (“hard”) or representatives (“easy”).
    • Identify: IdentifyDuplicatesStage filters pairs with cosine_sim_score >= 1.0 - eps and writes Parquet files of duplicate ids for removal during export.

Before You Start

  • Verify local paths or configure S3-compatible credentials. Provide storage_options in read/write keyword arguments when reading or writing cloud paths.
  • Create output directories for KMeansStage, PairwiseStage, and IdentifyDuplicatesStage.

Quickstart

Use the generic semantic duplicate-removal stages with clip embeddings written to Parquet.

Pipeline Stage
Script Flags
1from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
2from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
3from nemo_curator.stages.deduplication.semantic.ranking import RankingStrategy
4from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
5
6kmeans = KMeansStage(
7 n_clusters=1000,
8 id_field="id",
9 embedding_field="embedding",
10 input_path="/path/to/embeddings/",
11 output_path="/path/to/kmeans_out/",
12 input_filetype="parquet",
13)
14
15pairwise = PairwiseStage(
16 id_field="id",
17 embedding_field="embedding",
18 input_path="/path/to/kmeans_out/",
19 output_path="/path/to/pairwise_out/",
20 ranking_strategy=RankingStrategy.metadata_based(
21 metadata_cols=["cosine_dist_to_cent", "id"],
22 ascending=[True, True],
23 ),
24)
25
26identify = IdentifyDuplicatesStage(
27 output_path="/path/to/duplicates/",
28 eps=0.1,
29)

Input format: Parquet with columns id and embedding (produced by the video pipeline’s embedding stages and writer). Duplicate removal operates at the clip level using these embeddings. The IdentifyDuplicatesStage writes Parquet files containing duplicate ids; perform removal by filtering out rows whose id appears in those files during export.

Embeddings are written by the ClipWriterStage under iv2_embd_parquet/ or ce1_embd_parquet/. For a runnable workflow, refer to the Split and Dedup tutorial.

Parameters

KMeansStage
PairwiseStage
IdentifyDuplicatesStage

KMeansStage (semantic clustering)

ParameterDescription
n_clustersNumber of clusters for K‑means (for example, 1,000+ for multi‑million clip sets).
id_fieldColumn name containing clip IDs (for example, "id").
embedding_fieldColumn with vector data (for example, "embedding").
input_pathPath to Parquet embeddings directory from the writer.
output_pathDirectory for K‑means outputs (sharded by cluster).
input_filetypeUse "parquet" for video embeddings.
embedding_dimEmbedding dimension (InternVideo2: 512; Cosmos‑Embed1 varies by variant).