Learn how to run the splitting pipeline to generate clips and embeddings, then remove near-duplicate clips using semantic duplicate removal.
Run the splitting example. Set VIDEO_DIR, OUT_DIR, and MODEL_DIR first.
Writer-related flags you can add:
The pipeline writes embeddings under $OUT_DIR/ce1_embd_parquet/ when using Cosmos-Embed1.
The pipeline writes embeddings to Parquet with two columns:
Use K-means clustering followed by pairwise similarity on the Parquet embeddings.
which_to_keep selects the representative within each cluster: “hard” keeps outliers far from the centroid, “easy” keeps the nearest to the centroid, and “random” ignores distance and picks randomly.
sim_metric sets the distance used for similarity: “cosine” uses cosine distance (1 − cosine similarity), while “l2” uses Euclidean distance.
pairwise_batch_size controls how many items are processed per GPU batch during pairwise similarity; larger values can be faster but require more GPU memory.
${OUTPUT_DIR}/kmeans/.${OUTPUT_DIR}/pairwise/ with columns including id, max_id, and cosine_sim_score.After duplicate removal, export curated clips and metadata for training. Common video exports:
${OUT_DIR}Video-specific pointers:
ClipWriterStage path helpers to locate outputs: nemo_curator/stages/video/io/clip_writer.py.
get_output_path_processed_videos(OUT_DIR)get_output_path_processed_clip_chunks(OUT_DIR), get_output_path_previews(OUT_DIR)${OUT_DIR}/ce1_embd_parquetThe following example packages clips and minimal JSON metadata into tar files: