nemo_curator.stages.deduplication.semantic.kmeans
nemo_curator.stages.deduplication.semantic.kmeans
Module Contents
Classes
Data
API
Bases: ProcessingStage[FileGroupTask, _EmptyTask], DeduplicationIO
KMeans clustering stage that requires RAFT for distributed processing.
Computes the L2 distance to nearest centroid to each embedding in the DataFrame. Embeddings are normalized. For cosine we’ll need to normalize the centroids as well.
Process a batch of FileGroupTasks using distributed RAFT KMeans.
In RAFT mode, each actor processes its assigned tasks, but the KMeans model is trained cooperatively across all actors using RAFT communication.
This method:
- Reads data from this actor’s assigned tasks
- Breaks data into subgroups to avoid cudf row limits
- Fits distributed KMeans model (coordinates with other actors via RAFT)
- Assigns cluster centroids back to each subgroup
- Writes the results for each subgroup
Bases: CompositeStage[_EmptyTask, _EmptyTask]
KMeans clustering stage that requires RAFT for distributed processing.
KMeans clustering stage that requires RAFT for distributed processing.
Initialize parent class after dataclass initialization.