Embeddings#
Generate clip-level embeddings for search, question answering, filtering, and duplicate removal.
Use Cases#
Prepare semantic vectors for search, clustering, and near-duplicate detection.
Score optional text prompts against clip content.
Enable downstream filtering or retrieval tasks that need clip-level vectors.
Before You Start#
Create clips upstream. Refer to Clipping.
Provide frames for embeddings or sample at the required rate. Refer to Frame Extraction.
Access to model weights on each node (the stages download weights if missing).
Quickstart#
Use the pipeline stages or the example script flags to generate clip-level embeddings.
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.clipping.clip_frame_extraction import ClipFrameExtractionStage
from nemo_curator.utils.decoder_utils import FrameExtractionPolicy, FramePurpose
from nemo_curator.stages.video.embedding.internvideo2 import (
InternVideo2FrameCreationStage,
InternVideo2EmbeddingStage,
)
pipe = Pipeline(name="video_embeddings_example")
pipe.add_stage(
ClipFrameExtractionStage(
extraction_policies=(FrameExtractionPolicy.sequence,),
extract_purposes=(FramePurpose.EMBEDDINGS,),
target_res=(-1, -1),
verbose=True,
)
)
pipe.add_stage(InternVideo2FrameCreationStage(model_dir="/models", target_fps=2.0, verbose=True))
pipe.add_stage(InternVideo2EmbeddingStage(model_dir="/models", gpu_memory_gb=20.0, verbose=True))
pipe.run()
# InternVideo2
python -m nemo_curator.examples.video.video_split_clip_example \
... \
--generate-embeddings \
--embedding-algorithm internvideo2 \
--embedding-gpu-memory-gb 20.0
# Cosmos-Embed1 (224p)
python -m nemo_curator.examples.video.video_split_clip_example \
... \
--generate-embeddings \
--embedding-algorithm cosmos-embed1-224p \
--embedding-gpu-memory-gb 20.0
Embedding Options#
Cosmos-Embed1#
Add
CosmosEmbed1FrameCreationStage
to transform extracted frames into model-ready tensors.from nemo_curator.stages.video.embedding.cosmos_embed1 import ( CosmosEmbed1FrameCreationStage, CosmosEmbed1EmbeddingStage, ) frames = CosmosEmbed1FrameCreationStage( model_dir="/models", variant="224p", # or 336p, 448p target_fps=2.0, verbose=True, )
Add
CosmosEmbed1EmbeddingStage
to generateclip.cosmos_embed1_embedding
and optionalclip.cosmos_embed1_text_match
.embed = CosmosEmbed1EmbeddingStage( model_dir="/models", variant="224p", gpu_memory_gb=20.0, verbose=True, )
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Directory for model utilities and configs used to format input frames. |
|
{“224p”, “336p”, “448p”} |
|
Resolution preset that controls the model’s expected input size. |
|
float |
2.0 |
Source sampling rate used to select frames; may re-extract at higher FPS if needed. |
|
int |
3 |
CPU cores used when on-the-fly re-extraction is required. |
|
bool |
|
Log per-clip decisions and re-extraction messages. |
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Directory for model weights; downloaded on each node if missing. |
|
{“224p”, “336p”, “448p”} |
|
Resolution preset used by the model weights. |
|
int |
20 |
Approximate GPU memory reservation per worker. |
|
list[str] | None |
|
Optional text prompts to score against the clip embedding. |
|
bool |
|
Log setup and per-clip outcomes. |
Outputs#
clip.cosmos_embed1_frames
→ temporary tensors used by the embedding stageclip.cosmos_embed1_embedding
→ final clip-level vector (NumPy array)Optional:
clip.cosmos_embed1_text_match
InternVideo2#
Add
InternVideo2FrameCreationStage
to transform extracted frames into model-ready tensors.from nemo_curator.stages.video.embedding.internvideo2 import ( InternVideo2FrameCreationStage, InternVideo2EmbeddingStage, ) frames = InternVideo2FrameCreationStage( model_dir="/models", target_fps=2.0, verbose=True, )
Add
InternVideo2EmbeddingStage
to generateclip.intern_video_2_embedding
and optionalclip.intern_video_2_text_match
.embed = InternVideo2EmbeddingStage( model_dir="/models", gpu_memory_gb=20.0, verbose=True, )
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Directory for model utilities used to format input frames. |
|
float |
2.0 |
Source sampling rate used to select frames; may re-extract at higher FPS if needed. |
|
bool |
|
Log re-extraction and per-clip messages. |
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Directory for model weights; downloaded on each node if missing. |
|
float |
10.0 |
Approximate GPU memory reservation per worker. |
|
float |
1.0 |
GPUs reserved per worker for embedding. |
|
list[str] | None |
|
Optional text prompts to score against the clip embedding. |
|
bool |
|
Log setup and per-clip outcomes. |
Outputs#
clip.intern_video_2_frames
→ temporary tensors used by the embedding stageclip.intern_video_2_embedding
→ final clip-level vector (NumPy array)Optional:
clip.intern_video_2_text_match
Troubleshooting#
Not enough frames for embeddings: Increase
target_fps
during frame extraction or adjust clip length so that the model receives the required number of frames.Out of memory during embedding: Lower
gpu_memory_gb
, reduce batch size if exposed, or use a smaller resolution variant.Weights not found on node: Confirm
model_dir
and network access. The stages download weights if missing.
Next Steps#
Use embeddings for duplicate removal. Refer to Duplicate Removal.
Generate captions and previews for review workflows. Refer to Captions & Preview.