> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.deduplication.exact.identification

## Module Contents

### Classes

| Name                                                                                                                   | Description                                          |
| ---------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------- |
| [`ExactDuplicateIdentification`](#nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification) | Stage that finds exact duplicates in a given column. |

### Data

[`EXACT_DUPLICATE_GROUP_FIELD`](#nemo_curator-stages-deduplication-exact-identification-EXACT_DUPLICATE_GROUP_FIELD)

### API

<Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification(
        text_field: str,
        output_path: str,
        input_filetype: typing.Literal['jsonl', 'parquet'] = 'parquet',
        read_kwargs: dict[str, typing.Any] | None = None,
        write_kwargs: dict[str, typing.Any] | None = None,
        assign_id: bool = True,
        id_field: str | None = None,
        total_nparts: int | None = None,
        rmm_pool_size: int | typing.Literal['auto'] | None = 'auto',
        spill_memory_limit: int | typing.Literal['auto'] | None = 'auto',
        enable_statistics: bool = False
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [DeduplicationIO](/nemo-curator/nemo_curator/stages/deduplication/io_utils#nemo_curator-stages-deduplication-io_utils-DeduplicationIO), [ShuffleStage](/nemo-curator/nemo_curator/stages/deduplication/shuffle_utils/stage#nemo_curator-stages-deduplication-shuffle_utils-stage-ShuffleStage)

  Stage that finds exact duplicates in a given column.

  ## Parameters

  text\_field
  Field name representing the field to find duplicates in.
  output\_path
  Path to write output files.
  input\_filetype
  Type of the input files.
  Must be one of "jsonl" or "parquet". Default is "parquet".
  read\_kwargs
  Keyword arguments for cudf.read\_parquet method.
  write\_kwargs
  Keyword arguments for cudf.to\_parquet method.
  assign\_id
  Whether to assign a unique id to each document.
  id\_field
  Existing id field name if not assigning a new id.
  total\_nparts
  Total number of output partitions. If None, will be set automatically by the executor.
  rmm\_pool\_size
  Size of the RMM GPU memory pool in bytes.
  If "auto", the memory pool is set to 90% of the free GPU memory.
  If None, the memory pool is set to 50% of the free GPU memory that can expand if needed.
  spill\_memory\_limit
  Device memory limit in bytes for spilling to host.
  If "auto", the limit is set to 80% of the RMM pool size.
  If None spilling is disabled.
  enable\_statistics
  Whether the underlying rapidsmpf shuffler should collect shuffle statistics.

  <ParamField path="id_field" />

  <ParamField path="name" type="= 'ExactDuplicateIds'" />

  <ParamField path="output_fs" />

  <ParamField path="output_path" type="= self.output_fs.sep.join([output_path, self.name])" />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-_get_removal_ids">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification._get_removal_ids(
          df: cudf.DataFrame
      ) -> cudf.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Get the removal ids for the given dataframe.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-_hash_and_insert">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification._hash_and_insert(
          df: cudf.DataFrame
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Hash the text field and insert into the shuffle actor.

    ## Parameters

    df
    DataFrame containing the id\_field and text\_field columns.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-_read_files">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification._read_files(
          filepaths: list[str]
      ) -> cudf.DataFrame
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Read files and return a DataFrame.

    ## Parameters

    filepaths
    List of file paths to read.

    ## Returns

    cudf.DataFrame
    DataFrame containing the id\_field and text\_field columns.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-extract_and_write">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.extract_and_write() -> list[nemo_curator.tasks.FileGroupTask]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-inputs">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.inputs() -> tuple[list[str], list[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-insert_finished">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.insert_finished() -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-outputs">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.outputs() -> tuple[list[str], list[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-process">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.process(
          task: nemo_curator.tasks.FileGroupTask
      ) -> nemo_curator.tasks.FileGroupTask
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-ray_stage_spec">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.ray_stage_spec() -> dict[str, typing.Any]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-read_and_insert">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.read_and_insert(
          task: nemo_curator.tasks.FileGroupTask
      ) -> nemo_curator.tasks.FileGroupTask
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Single task processing is not supported.

    ## Raises

    NotImplementedError
    Always raised as this stage only supports batch processing.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-read_and_insert_batch">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.read_and_insert_batch(
          tasks: list[nemo_curator.tasks.FileGroupTask]
      ) -> list[nemo_curator.tasks.FileGroupTask]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Batch process multiple file group tasks for exact deduplication.

    This method reads all files from all tasks, concatenates them (if needed),
    hashes the text field using MD5, and inserts into the shuffle actor for
    deduplication. Processing tasks in batches significantly improves
    throughput by increasing the size of batches inserted during shuffle.

    ## Parameters

    tasks
    List of FileGroupTask objects containing files to process.
    Must contain at least one task.

    ## Returns

    list\[FileGroupTask]
    The input tasks unchanged. The actual deduplication results are
    written through the shuffle actor as a side effect.

    ## Raises

    RuntimeError
    If ID generator is not initialized when assign\_id is True.
  </Indent>

  <Anchor id="nemo_curator-stages-deduplication-exact-identification-ExactDuplicateIdentification-setup">
    <CodeBlock links={{"nemo_curator.backends.base.WorkerMetadata":"/nemo-curator/nemo_curator/backends/base#nemo_curator-backends-base-WorkerMetadata"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.deduplication.exact.identification.ExactDuplicateIdentification.setup(
          _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_curator-stages-deduplication-exact-identification-EXACT_DUPLICATE_GROUP_FIELD">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.stages.deduplication.exact.identification.EXACT_DUPLICATE_GROUP_FIELD = '_exact_duplicate_group'
    ```
  </CodeBlock>
</Anchor>