Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Modifiers
Base Class
- class nemo_curator.modifiers.DocumentModifier
Module
- class nemo_curator.Modify(modifier: nemo_curator.modifiers.doc_modifier.DocumentModifier, text_field='text')
Modifiers
- class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)
If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.
- class nemo_curator.modifiers.FastTextLabelModifier(label)
- class nemo_curator.modifiers.UnicodeReformatter
- class nemo_curator.modifiers.PiiModifier(language: str = 'en', supported_entities: Optional[List[str]] = None, anonymize_action: str = 'redact', batch_size: int = 2000, device: str = 'gpu', **kwargs)
This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:
dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)
- modifier = PiiModifier(
batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)
modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)
- load_deidentifier()
Helper function to load the de-identifier