nemo_curator.stages.deduplication.fuzzy.connected_components

View as Markdown

Module Contents

Classes

NameDescription
ConnectedComponentsStage-

API

class nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage(
output_path: str,
source_field: str | None = None,
destination_field: str | None = None,
read_kwargs: dict | None = None,
write_kwargs: dict | None = None
)

Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO

destination_field
= destination_field or f'{CURATOR_DEDUP_ID_STR}_y'
name
= self.__class__.__name__
output_fs
output_path
= self.output_fs.sep.join([output_path, self.name])
read_kwargs
= read_kwargs if read_kwargs is not None else {}
resources
= Resources(cpus=1.0, gpus=1.0)
source_field
= source_field or f'{CURATOR_DEDUP_ID_STR}_x'
write_kwargs
= write_kwargs if write_kwargs is not None else {}
nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage._setup_post() -> None

Setup the sub-communicator for cuGraph communications.

This method is specific to cuGraph comms and is used to initialize the sub-communicator.

nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage.process(
task: nemo_curator.tasks.file_group.FileGroupTask
) -> nemo_curator.tasks.file_group.FileGroupTask
nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage.process_batch(
tasks: list[nemo_curator.tasks.file_group.FileGroupTask]
) -> list[nemo_curator.tasks.file_group.FileGroupTask]

Process a batch of input files containing edges between documents. Compute the weakly connected components of the graph and write a mapping of document ids to their connected component id.

Parameters

tasks: list[FileGroupTask] A list of FileGroupTasks containing the input files. Returns

list[FileGroupTask] A list of FileGroupTasks containing the output doc_id to connected component id mapping.

nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage.setup(
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.deduplication.fuzzy.connected_components.ConnectedComponentsStage.weakly_connected_components(
df: cudf.DataFrame,
src_col: str,
dst_col: str
) -> None

Compute the weakly connected components of a graph.

This method loads a chunk of the graph, creates a cuGraph object, and computes the weakly connected components using the MGGraph library.

Parameters

start: int The start index of the chunk. stop: int The stop index of the chunk.