nemo_curator.stages.deduplication.exact.identification
nemo_curator.stages.deduplication.exact.identification
Module Contents
Classes
Data
API
Bases: DeduplicationIO, ShuffleStage
Stage that finds exact duplicates in a given column.
Parameters
text_field Field name representing the field to find duplicates in. output_path Path to write output files. input_filetype Type of the input files. Must be one of “jsonl” or “parquet”. Default is “parquet”. read_kwargs Keyword arguments for cudf.read_parquet method. write_kwargs Keyword arguments for cudf.to_parquet method. assign_id Whether to assign a unique id to each document. id_field Existing id field name if not assigning a new id. total_nparts Total number of output partitions. If None, will be set automatically by the executor. rmm_pool_size Size of the RMM GPU memory pool in bytes. If “auto”, the memory pool is set to 90% of the free GPU memory. If None, the memory pool is set to 50% of the free GPU memory that can expand if needed. spill_memory_limit Device memory limit in bytes for spilling to host. If “auto”, the limit is set to 80% of the RMM pool size. If None spilling is disabled. enable_statistics Whether the underlying rapidsmpf shuffler should collect shuffle statistics.
Get the removal ids for the given dataframe.
Hash the text field and insert into the shuffle actor.
Parameters
df DataFrame containing the id_field and text_field columns.
Read files and return a DataFrame.
Parameters
filepaths List of file paths to read.
Returns
cudf.DataFrame DataFrame containing the id_field and text_field columns.
Single task processing is not supported.
Raises
NotImplementedError Always raised as this stage only supports batch processing.
Batch process multiple file group tasks for exact deduplication.
This method reads all files from all tasks, concatenates them (if needed), hashes the text field using MD5, and inserts into the shuffle actor for deduplication. Processing tasks in batches significantly improves throughput by increasing the size of batches inserted during shuffle.
Parameters
tasks List of FileGroupTask objects containing files to process. Must contain at least one task.
Returns
list[FileGroupTask] The input tasks unchanged. The actual deduplication results are written through the shuffle actor as a side effect.
Raises
RuntimeError If ID generator is not initialized when assign_id is True.