`modifiers.pii_modifier`#

Module Contents#

Classes#

PiiModifier

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

Data#

DEFAULT_BATCH_SIZE

API#

modifiers.pii_modifier.DEFAULT_BATCH_SIZE#: 2000

class modifiers.pii_modifier.PiiModifier(

language: str = DEFAULT_LANGUAGE,

supported_entities: list[str] | None = None,

anonymize_action: str = 'redact',

batch_size: int = DEFAULT_BATCH_SIZE,

device: str = 'gpu',

**kwargs,

)#

Bases: nemo_curator.modifiers.DocumentModifier

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = PiiModifier( batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)

Initialization

load_deidentifier() → nemo_curator.pii.algorithm.PiiDeidentifier#: Helper function to load the de-identifier

modify_document( text: pandas.Series, partition_info: dict | None = None, ) → pandas.Series#

modifiers.pii_modifier#

Module Contents#

Classes#

Data#

API#

`modifiers.pii_modifier`#