> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.deduplication.fuzzy.minhash

## Module Contents

### Classes

| Name                                                                            | Description                                                                            |
| ------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| [`GPUMinHash`](#nemo_curator-stages-deduplication-fuzzy-minhash-GPUMinHash)     | -                                                                                      |
| [`MinHash`](#nemo_curator-stages-deduplication-fuzzy-minhash-MinHash)           | Base class for computing minhash signatures of a document corpus                       |
| [`MinHashStage`](#nemo_curator-stages-deduplication-fuzzy-minhash-MinHashStage) | ProcessingStage for computing MinHash signatures on documents for fuzzy deduplication. |

### API

<Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-GPUMinHash">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash(
        seed: int = 42,
        num_hashes: int = 260,
        char_ngrams: int = 24,
        use_64bit_hash: bool = False,
        pool: bool = False
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [MinHash](#nemo_curator-stages-deduplication-fuzzy-minhash-MinHash)

  <ParamField path="seeds" />

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-GPUMinHash-compute_minhashes">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.compute_minhashes(
          text_series: cudf.Series
      ) -> cudf.Series
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Compute minhash signatures for the given text series.

    ## Parameters

    text\_series: cudf.Series
    Series containing text data to compute minhashes for

    ## Returns

    cudf.Series containing minhash signatures
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-GPUMinHash-generate_seeds">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.generate_seeds(
          n_permutations: int = 260,
          seed: int = 0,
          bit_width: int = 32
      ) -> numpy.ndarray
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Generate seeds for all minhash permutations based on the given seed.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-GPUMinHash-minhash32">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.minhash32(
          ser: cudf.Series
      ) -> cudf.Series
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Compute 32bit minhashes based on the MurmurHash3 algorithm
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-GPUMinHash-minhash64">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.GPUMinHash.minhash64(
          ser: cudf.Series
      ) -> cudf.Series
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Compute 64bit minhashes based on the MurmurHash3 algorithm
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHash">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.deduplication.fuzzy.minhash.MinHash(
        seed: int = 42,
        num_hashes: int = 260,
        char_ngrams: int = 24,
        use_64bit_hash: bool = False
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Abstract
  </Badge>

  Base class for computing minhash signatures of a document corpus

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHash-compute_minhashes">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.MinHash.compute_minhashes(
          text_series: typing.Any
      ) -> typing.Any
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      abstract
    </Badge>

    Compute minhash signatures for the given dataframe text column.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHash-generate_seeds">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.MinHash.generate_seeds(
          n_permutations: int = 260,
          seed: int = 0,
          bit_width: int = 32
      ) -> numpy.ndarray
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Generate seeds for all minhash permutations based on the given seed.
    This is a placeholder that child classes should implement if needed.
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHashStage">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage(
        output_path: str,
        text_field: str = 'text',
        minhash_field: str = CURATOR_DEFAULT_MINHASH_FIELD,
        char_ngrams: int = 24,
        num_hashes: int = 260,
        seed: int = 42,
        use_64bit_hash: bool = False,
        read_format: typing.Literal['jsonl', 'parquet'] = 'jsonl',
        read_kwargs: dict[str, typing.Any] | None = None,
        write_kwargs: dict[str, typing.Any] | None = None,
        pool: bool = True
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [ProcessingStage\[FileGroupTask, FileGroupTask\]](/nemo-curator/nemo_curator/stages/base#nemo_curator-stages-base-ProcessingStage), [DeduplicationIO](/nemo-curator/nemo_curator/stages/deduplication/io_utils#nemo_curator-stages-deduplication-io_utils-DeduplicationIO)

  ProcessingStage for computing MinHash signatures on documents for fuzzy deduplication.

  This stage takes FileGroupTask containing paths to input documents and produces
  FileGroupTask containing paths to computed minhash signature files. It uses GPU-accelerated
  MinHash computation to generate locality-sensitive hash signatures that can be used
  for approximate duplicate detection.

  The stage automatically handles:

  * Reading input files (JSONL or Parquet format)
  * Assigning unique Integer IDs to documents using the IdGenerator actor
  * Computing MinHash signatures using GPU acceleration
  * Writing results to Parquet files

  ## Parameters

  output\_path : str
  Base path where minhash output files will be written
  text\_field : str, default="text"
  Name of the field containing text to compute minhashes from
  minhash\_field : str, default="\_minhash\_signature"
  Name of the field where minhash signatures will be stored
  char\_ngrams : int, default=24
  Width of character n-grams for minhashing
  num\_hashes : int, default=260
  Number of hash functions (length of minhash signature)
  seed : int, default=42
  Random seed for reproducible minhash generation
  use\_64bit\_hash : bool, default=False
  Whether to use 64-bit hash functions (vs 32-bit)
  read\_format : Literal\["jsonl", "parquet"], default="jsonl"
  Format of input files
  read\_kwargs : dict\[str, Any] | None, default=None
  Additional keyword arguments for reading input files
  write\_kwargs : dict\[str, Any] | None, default=None
  Additional keyword arguments for writing output files

  ## Examples

  \>>> stage = MinHashStage(
  ...     output\_path="/path/to/minhash/output",
  ...     text\_field="content",
  ...     num\_hashes=128,
  ...     char\_ngrams=5
  ... )
  \>>> # Use in a pipeline to process document batches

  <ParamField path="name" type="= self.__class__.__name__" />

  <ParamField path="output_fs" />

  <ParamField path="output_path" type="= self.output_fs.sep.join([output_path, self.name])" />

  <ParamField path="read_kwargs" type="= read_kwargs or {}" />

  <ParamField path="resources" type="= Resources(gpus=1.0)" />

  <ParamField path="write_kwargs" type="= write_kwargs or {}" />

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHashStage-inputs">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.inputs() -> tuple[list[str], list[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Define input requirements.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHashStage-outputs">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.outputs() -> tuple[list[str], list[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Define outputs - produces FileGroupTask with minhash files.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHashStage-process">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.process(
          task: nemo_curator.tasks.FileGroupTask
      ) -> nemo_curator.tasks.FileGroupTask
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Process a group of files to compute minhashes.

    **Parameters:**

    <ParamField path="task" type="FileGroupTask">
      FileGroupTask containing file paths to process
    </ParamField>

    **Returns:** `FileGroupTask`

    FileGroupTask containing paths to minhash output files
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-minhash-MinHashStage-setup">
    <CodeBlock links={{"nemo_curator.backends.base.WorkerMetadata":"/nemo-curator/nemo_curator/backends/base#nemo_curator-backends-base-WorkerMetadata"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.minhash.MinHashStage.setup(
          _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Initialize the GPU MinHash processor and ID generator.
  </Indent>
</Indent>