modifiers.pii_modifier
#
Module Contents#
Classes#
This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or
other formats. It works with the |
Data#
API#
- modifiers.pii_modifier.DEFAULT_BATCH_SIZE#
2000
- class modifiers.pii_modifier.PiiModifier(
- language: str = DEFAULT_LANGUAGE,
- supported_entities: list[str] | None = None,
- anonymize_action: str = 'redact',
- batch_size: int = DEFAULT_BATCH_SIZE,
- device: str = 'gpu',
- **kwargs,
Bases:
nemo_curator.modifiers.DocumentModifier
This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the
Modify
functionality as shown below:dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)
modifier = PiiModifier( batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)
modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)
Initialization
- load_deidentifier() nemo_curator.pii.algorithm.PiiDeidentifier #
Helper function to load the de-identifier
- modify_document(
- text: pandas.Series,
- partition_info: dict | None = None,