nemo_curator.stages.deduplication.fuzzy.connected_components
nemo_curator.stages.deduplication.fuzzy.connected_components
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO
Setup the sub-communicator for cuGraph communications.
This method is specific to cuGraph comms and is used to initialize the sub-communicator.
Process a batch of input files containing edges between documents. Compute the weakly connected components of the graph and write a mapping of document ids to their connected component id.
Parameters
tasks: list[FileGroupTask] A list of FileGroupTasks containing the input files. Returns
list[FileGroupTask] A list of FileGroupTasks containing the output doc_id to connected component id mapping.
Compute the weakly connected components of a graph.
This method loads a chunk of the graph, creates a cuGraph object, and computes the weakly connected components using the MGGraph library.
Parameters
start: int The start index of the chunk. stop: int The stop index of the chunk.