nemo_curator.stages.deduplication.fuzzy.minhash
nemo_curator.stages.deduplication.fuzzy.minhash
Module Contents
Classes
API
Bases: MinHash
Compute minhash signatures for the given text series.
Parameters
text_series: cudf.Series Series containing text data to compute minhashes for
Returns
cudf.Series containing minhash signatures
Generate seeds for all minhash permutations based on the given seed.
Compute 32bit minhashes based on the MurmurHash3 algorithm
Compute 64bit minhashes based on the MurmurHash3 algorithm
Base class for computing minhash signatures of a document corpus
Compute minhash signatures for the given dataframe text column.
Generate seeds for all minhash permutations based on the given seed. This is a placeholder that child classes should implement if needed.
Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO
ProcessingStage for computing MinHash signatures on documents for fuzzy deduplication.
This stage takes FileGroupTask containing paths to input documents and produces FileGroupTask containing paths to computed minhash signature files. It uses GPU-accelerated MinHash computation to generate locality-sensitive hash signatures that can be used for approximate duplicate detection.
The stage automatically handles:
- Reading input files (JSONL or Parquet format)
- Assigning unique Integer IDs to documents using the IdGenerator actor
- Computing MinHash signatures using GPU acceleration
- Writing results to Parquet files
Parameters
output_path : str Base path where minhash output files will be written text_field : str, default=“text” Name of the field containing text to compute minhashes from minhash_field : str, default=“_minhash_signature” Name of the field where minhash signatures will be stored char_ngrams : int, default=24 Width of character n-grams for minhashing num_hashes : int, default=260 Number of hash functions (length of minhash signature) seed : int, default=42 Random seed for reproducible minhash generation use_64bit_hash : bool, default=False Whether to use 64-bit hash functions (vs 32-bit) read_format : Literal[“jsonl”, “parquet”], default=“jsonl” Format of input files read_kwargs : dict[str, Any] | None, default=None Additional keyword arguments for reading input files write_kwargs : dict[str, Any] | None, default=None Additional keyword arguments for writing output files
Examples
>>> stage = MinHashStage( … output_path=“/path/to/minhash/output”, … text_field=“content”, … num_hashes=128, … char_ngrams=5 … ) >>> # Use in a pipeline to process document batches
Define input requirements.
Define outputs - produces FileGroupTask with minhash files.
Process a group of files to compute minhashes.
Parameters:
FileGroupTask containing file paths to process
Returns: FileGroupTask
FileGroupTask containing paths to minhash output files
Initialize the GPU MinHash processor and ID generator.