> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.deduplication.semantic.identify_duplicates

## Module Contents

### Classes

| Name                                                                                                                 | Description                                                                       |
| -------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| [`IdentifyDuplicatesStage`](#nemo_curator-stages-deduplication-semantic-identify_duplicates-IdentifyDuplicatesStage) | Stage for batch removal of similar documents with optional ID-based partitioning. |

### API

<Anchor id="nemo_curator-stages-deduplication-semantic-identify_duplicates-IdentifyDuplicatesStage">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage(
        output_path: str,
        eps: float,
        _num_row_groups_hint: int | None = None,
        verbose: bool = False,
        read_kwargs: dict[str, typing.Any] | None = None,
        write_kwargs: dict[str, typing.Any] | None = None
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Dataclass
  </Badge>

  **Bases:** [ProcessingStage\[FileGroupTask, FileGroupTask\]](/nemo-curator/nemo_curator/stages/base#nemo_curator-stages-base-ProcessingStage)

  Stage for batch removal of similar documents with optional ID-based partitioning.
  It is a CPU-only stage.

  <ParamField path="_num_row_groups_hint" type="int | None = None" />

  <ParamField path="eps" type="float" />

  <ParamField path="output_path" type="str" />

  <ParamField path="read_kwargs" type="dict[str, Any] | None = None" />

  <ParamField path="verbose" type="bool = False" />

  <ParamField path="write_kwargs" type="dict[str, Any] | None = None" />

  <Anchor id="nemo_curator-stages-deduplication-semantic-identify_duplicates-IdentifyDuplicatesStage-__post_init__">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage.__post_init__()
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Initialize parent class after dataclass initialization.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-semantic-identify_duplicates-IdentifyDuplicatesStage-process">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage.process(
          task: nemo_curator.tasks.FileGroupTask
      ) -> nemo_curator.tasks.FileGroupTask
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-semantic-identify_duplicates-IdentifyDuplicatesStage-process_batch">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage.process_batch(
          tasks: list[nemo_curator.tasks.FileGroupTask]
      ) -> list[nemo_curator.tasks.FileGroupTask]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Process a batch of tasks and combine results into fewer output files.

    This allows processing multiple clusters together and optionally partitioning
    by ID ranges for more efficient reading.

    **Parameters:**

    <ParamField path="tasks" type="list[FileGroupTask]">
      List of FileGroupTask containing pairwise similarity results
    </ParamField>

    **Returns:** `list[FileGroupTask]`

    List of FileGroupTask with combined filtered results
  </Indent>
</Indent>