> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.deduplication.fuzzy.lsh.lsh

## Module Contents

### Classes

| Name                                                                    | Description                                                 |
| ----------------------------------------------------------------------- | ----------------------------------------------------------- |
| [`LSHActor`](#nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor) | Actor that performs LSH operations and shuffling using Ray. |

### API

<Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor(
        nranks: int,
        total_nparts: int,
        num_bands: int,
        minhashes_per_band: int,
        id_field: str = CURATOR_DEDUP_ID_STR,
        minhash_field: str = CURATOR_DEFAULT_MINHASH_FIELD,
        output_path: str = './',
        rmm_pool_size: int | typing.Literal['auto'] | None = 'auto',
        spill_memory_limit: int | typing.Literal['auto'] | None = 'auto',
        enable_statistics: bool = False,
        read_kwargs: dict[str, typing.Any] | None = None,
        write_kwargs: dict[str, typing.Any] | None = None
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [BulkRapidsMPFShuffler](/nemo-curator/nemo_curator/stages/deduplication/shuffle_utils/rapidsmpf_shuffler#nemo_curator-stages-deduplication-shuffle_utils-rapidsmpf_shuffler-BulkRapidsMPFShuffler)

  Actor that performs LSH operations and shuffling using Ray.

  ## Parameters

  nranks
  Number of ranks in the communication group.
  total\_nparts
  Total number of output partitions.
  num\_bands
  Number of LSH bands.
  minhashes\_per\_band
  Number of minhashes per band.
  id\_field
  Name of the ID field in input data.
  minhash\_field
  Name of the minhash field in input data.
  output\_path
  Path to write output files.
  rmm\_pool\_size
  Size of the RMM GPU memory pool in bytes.
  If "auto", the memory pool is set to 90% of the free GPU memory.
  If None, the memory pool is set to 50% of the free GPU memory that can expand if needed.
  spill\_memory\_limit
  Device memory limit in bytes for spilling to host.
  If "auto", the limit is set to 80% of the RMM pool size.
  If None spilling is disabled.
  enable\_statistics
  Whether to collect statistics.
  read\_kwargs
  Keyword arguments for the read method.
  write\_kwargs
  Keyword arguments for the write method.

  ## Notes

  Architecture and Processing Flow:

  This implementation follows a clean separation of responsibilities with distinct methods
  for each part of the pipeline:

  Input Phase:

  * `read_minhash`: Reads minhash files and returns a DataFrame

  Processing Phase:

  * `minhash_to_bands`: Transforms a single minhash DataFrame into LSH bands
  * `read_and_insert`: Orchestrates reading, band creation, and insertion

  Output Phase:

  * `extract_and_group`: Extracts and groups shuffled data, yielding results as a generator
  * `extract_and_write`: Processes each yielded result and writes to output files immediately

  1. Files are read using `read_minhash`
  2. Data is processed with `minhash_to_bands` to extract LSH bucket IDs
  3. Processed data is immediately inserted into the shuffler
  4. Results are extracted and processed one partition at a time using generators
  5. Each partition is written to disk as soon as it's processed, without accumulating in memory

  <ParamField path="read_kwargs" type="= read_kwargs if read_kwargs is not None else {}" />

  <ParamField path="write_kwargs" type="= write_kwargs if write_kwargs is not None else {}" />

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-_generate_band_ranges">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor._generate_band_ranges(
          num_bands: int,
          minhashes_per_band: int
      ) -> list[list[int]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      staticmethod
    </Badge>

    Generates a list of indices for the minhash ranges given num\_bands &
    minhashes\_per\_band.
    eg: num\_bands=3, minhashes\_per\_band=2
    \[\[0, 1], \[2, 3], \[4, 5]]
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-extract_and_group">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor.extract_and_group() -> collections.abc.Iterator[tuple[int, cudf.DataFrame]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Extract shuffled partitions and group by bucket ID, yielding results one by one.

    This generator approach allows processing each partition immediately after it's ready,
    which is more memory-efficient than collecting all partitions first.

    ## Yields

    tuple
    A tuple of (partition\_id, grouped\_df) where grouped\_df contains bucket IDs
    and their corresponding document ID lists.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-extract_and_write">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor.extract_and_write() -> list[dict[str, typing.Any]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Extract shuffled partitions, group by bucket ID, and write results to files.

    This method orchestrates the post-processing pipeline:

    1. Extracts partitioned data from the shuffler using extract\_and\_group
    2. Writes each grouped partition to a parquet file as soon as it's available

    This generator-based approach is more memory-efficient since it processes
    one partition at a time rather than collecting all partitions in memory.

    ## Returns

    list\[dict\[str, Any]]
    A list of dictionaries containing partition information.
    Each dictionary contains:

    * partition\_id: The ID of the partition
    * path: The path to the partition file
    * num\_docs: The number of documents in the partition
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-group_by_bucket">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor.group_by_bucket(
          df: cudf.DataFrame,
          include_singles: bool = False
      ) -> cudf.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Group items by bucket ID and aggregate IDs into lists.

    ## Parameters

    df
    DataFrame containing bucket IDs and document IDs.
    include\_singles
    If True, include buckets with only one document. Default is False, which
    excludes single-document buckets as they cannot form duplicates. Set to True
    when building an LSH index that needs to maintain all documents.

    ## Returns

    DataFrame with bucket IDs and lists of document IDs.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-minhash_to_bands">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor.minhash_to_bands(
          minhash_df: cudf.DataFrame,
          band_range: tuple[int, int]
      ) -> cudf.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Process a single minhash DataFrame to extract LSH band data.

    ## Parameters

    minhash\_df
    DataFrame containing minhash data.
    band\_range
    Tuple of (start\_band, end\_band) to process.

    ## Returns

    DataFrame with document IDs and their corresponding bucket IDs.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-read_and_insert">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor.read_and_insert(
          filepaths: list[str],
          band_range: tuple[int, int]
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Read minhashes from files, create LSH bands, and insert into the shuffler.

    This method orchestrates the full processing pipeline:

    1. Reads minhash data from parquet files in batches
    2. Processes each batch to extract LSH bands
    3. Inserts the bands into the shuffler for distribution

    ## Parameters

    filepaths
    List of paths to minhash files.
    band\_range
    Tuple of (start\_band, end\_band) to process.

    ## Returns

    None
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-fuzzy-lsh-lsh-LSHActor-read_minhash">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.fuzzy.lsh.lsh.LSHActor.read_minhash(
          filepaths: list[str]
      ) -> cudf.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Read minhash data from parquet files.

    ## Parameters

    filepaths
    List of paths to minhash files.

    ## Returns

    DataFrame containing minhash data from all input files.
  </Indent>
</Indent>