nemo_curator.stages.deduplication.fuzzy.minhash

View as Markdown

Module Contents

Classes

NameDescription
GPUMinHash-
MinHashBase class for computing minhash signatures of a document corpus
MinHashStageProcessingStage for computing MinHash signatures on documents for fuzzy deduplication.

API

class nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash(
seed: int = 42,
num_hashes: int = 260,
char_ngrams: int = 24,
use_64bit_hash: bool = False,
pool: bool = False
)

Bases: MinHash

seeds
nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.compute_minhashes(
text_series: cudf.Series
) -> cudf.Series

Compute minhash signatures for the given text series.

Parameters

text_series: cudf.Series Series containing text data to compute minhashes for

Returns

cudf.Series containing minhash signatures

nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.generate_seeds(
n_permutations: int = 260,
seed: int = 0,
bit_width: int = 32
) -> numpy.ndarray

Generate seeds for all minhash permutations based on the given seed.

nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.minhash32(
ser: cudf.Series
) -> cudf.Series

Compute 32bit minhashes based on the MurmurHash3 algorithm

nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.minhash64(
ser: cudf.Series
) -> cudf.Series

Compute 64bit minhashes based on the MurmurHash3 algorithm

class nemo_curator.stages.deduplication.fuzzy.minhash.MinHash(
seed: int = 42,
num_hashes: int = 260,
char_ngrams: int = 24,
use_64bit_hash: bool = False
)
Abstract

Base class for computing minhash signatures of a document corpus

nemo_curator.stages.deduplication.fuzzy.minhash.MinHash.compute_minhashes(
text_series: typing.Any
) -> typing.Any
abstract

Compute minhash signatures for the given dataframe text column.

nemo_curator.stages.deduplication.fuzzy.minhash.MinHash.generate_seeds(
n_permutations: int = 260,
seed: int = 0,
bit_width: int = 32
) -> numpy.ndarray

Generate seeds for all minhash permutations based on the given seed. This is a placeholder that child classes should implement if needed.

class nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage(
output_path: str,
text_field: str = 'text',
minhash_field: str = CURATOR_DEFAULT_MINHASH_FIELD,
char_ngrams: int = 24,
num_hashes: int = 260,
seed: int = 42,
use_64bit_hash: bool = False,
read_format: typing.Literal['jsonl', 'parquet'] = 'jsonl',
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None,
pool: bool = True
)

Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO

ProcessingStage for computing MinHash signatures on documents for fuzzy deduplication.

This stage takes FileGroupTask containing paths to input documents and produces FileGroupTask containing paths to computed minhash signature files. It uses GPU-accelerated MinHash computation to generate locality-sensitive hash signatures that can be used for approximate duplicate detection.

The stage automatically handles:

  • Reading input files (JSONL or Parquet format)
  • Assigning unique Integer IDs to documents using the IdGenerator actor
  • Computing MinHash signatures using GPU acceleration
  • Writing results to Parquet files

Parameters

output_path : str Base path where minhash output files will be written text_field : str, default=“text” Name of the field containing text to compute minhashes from minhash_field : str, default=“_minhash_signature” Name of the field where minhash signatures will be stored char_ngrams : int, default=24 Width of character n-grams for minhashing num_hashes : int, default=260 Number of hash functions (length of minhash signature) seed : int, default=42 Random seed for reproducible minhash generation use_64bit_hash : bool, default=False Whether to use 64-bit hash functions (vs 32-bit) read_format : Literal[“jsonl”, “parquet”], default=“jsonl” Format of input files read_kwargs : dict[str, Any] | None, default=None Additional keyword arguments for reading input files write_kwargs : dict[str, Any] | None, default=None Additional keyword arguments for writing output files

Examples

>>> stage = MinHashStage( … output_path=“/path/to/minhash/output”, … text_field=“content”, … num_hashes=128, … char_ngrams=5 … ) >>> # Use in a pipeline to process document batches

name
= self.__class__.__name__
output_fs
output_path
= self.output_fs.sep.join([output_path, self.name])
read_kwargs
= read_kwargs or {}
resources
= Resources(gpus=1.0)
write_kwargs
= write_kwargs or {}
nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.inputs() -> tuple[list[str], list[str]]

Define input requirements.

nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.outputs() -> tuple[list[str], list[str]]

Define outputs - produces FileGroupTask with minhash files.

nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.FileGroupTask

Process a group of files to compute minhashes.

Parameters:

task
FileGroupTask

FileGroupTask containing file paths to process

Returns: FileGroupTask

FileGroupTask containing paths to minhash output files

nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.setup(
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Initialize the GPU MinHash processor and ID generator.