Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Modifiers

Base Class

class nemo_curator.modifiers.DocumentModifier

Module

class nemo_curator.Modify(modifier: nemo_curator.modifiers.doc_modifier.DocumentModifier, text_field='text')

Modifiers

class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)

If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

class nemo_curator.modifiers.FastTextLabelModifier(label)
class nemo_curator.modifiers.UnicodeReformatter
class nemo_curator.modifiers.PiiModifier(language: str = 'en', supported_entities: Optional[List[str]] = None, anonymize_action: str = 'redact', batch_size: int = 2000, device: str = 'gpu', **kwargs)

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = PiiModifier(

batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)

load_deidentifier()

Helper function to load the de-identifier