Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Modifiers#

Base Class#

class nemo_curator.modifiers.DocumentModifier#

Module#

class nemo_curator.Modify(
modifier: DocumentModifier,
text_field='text',
)#

Modifiers#

class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)#

If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

class nemo_curator.modifiers.FastTextLabelModifier(label)#
class nemo_curator.modifiers.UnicodeReformatter#
class nemo_curator.modifiers.PiiModifier(
language: str = 'en',
supported_entities: List[str] | None = None,
anonymize_action: str = 'redact',
batch_size: int = 2000,
device: str = 'gpu',
**kwargs,
)#

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = PiiModifier(

batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)

load_deidentifier()#

Helper function to load the de-identifier