Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Modifiers#
Base Class#
- class nemo_curator.modifiers.DocumentModifier#
Module#
- class nemo_curator.Modify(
- modifier: DocumentModifier,
- text_field='text',
Modifiers#
- class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)#
If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.
- class nemo_curator.modifiers.FastTextLabelModifier(label)#
- class nemo_curator.modifiers.UnicodeReformatter#
- class nemo_curator.modifiers.PiiModifier(
- language: str = 'en',
- supported_entities: List[str] | None = None,
- anonymize_action: str = 'redact',
- batch_size: int = 2000,
- device: str = 'gpu',
- **kwargs,
This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:
dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)
- modifier = PiiModifier(
batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)
modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)
- load_deidentifier()#
Helper function to load the de-identifier