stages.deduplication.fuzzy.minhash
#
Module Contents#
Classes#
Base class for computing minhash signatures of a document corpus |
|
Base class for computing minhash signatures of a document corpus |
|
ProcessingStage for computing MinHash signatures on documents for fuzzy deduplication. |
API#
- class stages.deduplication.fuzzy.minhash.GPUMinHash(
- seed: int = 42,
- num_hashes: int = 260,
- char_ngrams: int = 24,
- use_64bit_hash: bool = False,
- pool: bool = False,
Bases:
stages.deduplication.fuzzy.minhash.MinHash
Base class for computing minhash signatures of a document corpus
Initialization
Parameters
seed: Seed for minhash permutations num_hashes: Length of minhash signature (No. of minhash permutations) char_ngrams: Width of text window (in characters) while computing minhashes. use_64bit_hash: Whether to use a 64 bit hash function.
- compute_minhashes(text_series: cudf.Series) cudf.Series #
Compute minhash signatures for the given text series.
Parameters
text_series: cudf.Series Series containing text data to compute minhashes for
Returns
cudf.Series containing minhash signatures
- generate_seeds(
- n_permutations: int = 260,
- seed: int = 0,
- bit_width: int = 32,
Generate seeds for all minhash permutations based on the given seed.
- minhash32(ser: cudf.Series) cudf.Series #
Compute 32bit minhashes based on the MurmurHash3 algorithm
- minhash64(ser: cudf.Series) cudf.Series #
Compute 64bit minhashes based on the MurmurHash3 algorithm
- class stages.deduplication.fuzzy.minhash.MinHash(
- seed: int = 42,
- num_hashes: int = 260,
- char_ngrams: int = 24,
- use_64bit_hash: bool = False,
Bases:
abc.ABC
Base class for computing minhash signatures of a document corpus
Initialization
Parameters
seed: Seed for minhash permutations num_hashes: Length of minhash signature (No. of minhash permutations) char_ngrams: Width of text window (in characters) while computing minhashes. use_64bit_hash: Whether to use a 64 bit hash function.
- abstractmethod compute_minhashes(text_series: Any) Any #
Compute minhash signatures for the given dataframe text column.
- generate_seeds(
- n_permutations: int = 260,
- seed: int = 0,
- bit_width: int = 32,
Generate seeds for all minhash permutations based on the given seed. This is a placeholder that child classes should implement if needed.
- class stages.deduplication.fuzzy.minhash.MinHashStage(
- output_path: str,
- text_field: str = 'text',
- minhash_field: str = CURATOR_DEFAULT_MINHASH_FIELD,
- char_ngrams: int = 24,
- num_hashes: int = 260,
- seed: int = 42,
- use_64bit_hash: bool = False,
- read_format: Literal[jsonl, parquet] = 'jsonl',
- read_kwargs: dict[str, Any] | None = None,
- write_kwargs: dict[str, Any] | None = None,
- pool: bool = True,
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks.FileGroupTask
,nemo_curator.tasks.FileGroupTask
],nemo_curator.stages.deduplication.io_utils.DeduplicationIO
ProcessingStage for computing MinHash signatures on documents for fuzzy deduplication.
This stage takes FileGroupTask containing paths to input documents and produces FileGroupTask containing paths to computed minhash signature files. It uses GPU-accelerated MinHash computation to generate locality-sensitive hash signatures that can be used for approximate duplicate detection.
The stage automatically handles:
Reading input files (JSONL or Parquet format)
Assigning unique Integer IDs to documents using the IdGenerator actor
Computing MinHash signatures using GPU acceleration
Writing results to Parquet files
Parameters
output_path : str Base path where minhash output files will be written text_field : str, default=”text” Name of the field containing text to compute minhashes from minhash_field : str, default=”_minhash_signature” Name of the field where minhash signatures will be stored char_ngrams : int, default=24 Width of character n-grams for minhashing num_hashes : int, default=260 Number of hash functions (length of minhash signature) seed : int, default=42 Random seed for reproducible minhash generation use_64bit_hash : bool, default=False Whether to use 64-bit hash functions (vs 32-bit) read_format : Literal[“jsonl”, “parquet”], default=”jsonl” Format of input files read_kwargs : dict[str, Any] | None, default=None Additional keyword arguments for reading input files write_kwargs : dict[str, Any] | None, default=None Additional keyword arguments for writing output files
Examples
stage = MinHashStage( … output_path=”/path/to/minhash/output”, … text_field=”content”, … num_hashes=128, … char_ngrams=5 … )
Use in a pipeline to process document batches
Initialization
- inputs() tuple[list[str], list[str]] #
Define input requirements.
- outputs() tuple[list[str], list[str]] #
Define outputs - produces FileGroupTask with minhash files.
- process(
- task: nemo_curator.tasks.FileGroupTask,
Process a group of files to compute minhashes.
Args: task: FileGroupTask containing file paths to process
Returns: FileGroupTask containing paths to minhash output files
- setup(_worker_metadata: WorkerMetadata | None = None) None #
Initialize the GPU MinHash processor and ID generator.