Modifiers#

Base Class#

class nemo_curator.modifiers.DocumentModifier#

property backend: Literal['pandas', 'cudf', 'any']#: The dataframe backend the modifier operates on. Can be ‘pandas’, ‘cudf’, or ‘any’. Defaults to ‘pandas’. :returns: A string representing the dataframe backend the modifier needs as input :rtype: str

Module#

class nemo_curator.Modify( modifier: DocumentModifier, text_field: str = 'text', )#

call( dataset: DocumentDataset, ) → DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:: dataset (DocumentDataset) – The dataset to operate on

Modifiers#

class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom: bool = True)#: If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

class nemo_curator.modifiers.FastTextLabelModifier(label: str)#

class nemo_curator.modifiers.UnicodeReformatter( config: ftfy.TextFixerConfig | None = None, unescape_html: str | bool = 'auto', remove_terminal_escapes: bool = True, fix_encoding: bool = True, restore_byte_a0: bool = True, replace_lossy_sequences: bool = True, decode_inconsistent_utf8: bool = True, fix_c1_controls: bool = True, fix_latin_ligatures: bool = False, fix_character_width: bool = False, uncurl_quotes: bool = False, fix_line_breaks: bool = False, fix_surrogates: bool = True, remove_control_chars: bool = True, normalization: Literal['NFC', 'NFD', 'NFKC', 'NFKD'] | None = None, max_decode_length: int = 1000000, explain: bool = True, )#

class nemo_curator.modifiers.PiiModifier(

language: str = 'en',

supported_entities: list[str] | None = None,

anonymize_action: str = 'redact',

batch_size: int = 2000,

device: str = 'gpu',

**kwargs,

)#

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = PiiModifier(: batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)

load_deidentifier() → PiiDeidentifier#: Helper function to load the de-identifier

class nemo_curator.modifiers.LineRemover(patterns: list[str])#: Removes lines from a document if the content of the line matches a given string.

class nemo_curator.modifiers.MarkdownRemover#: Removes Markdown formatting in a document including bold, italic, underline, and URL text.

class nemo_curator.modifiers.NewlineNormalizer#: Replaces 3 or more consecutive newline characters with only 2 newline characters.

class nemo_curator.modifiers.UrlRemover#: Removes all URLs in a document.

class nemo_curator.modifiers.Slicer( left: int | str | None = 0, right: str | int | None = None, include_left: bool = True, include_right: bool = True, strip: bool = True, )#: Slices a document based on indices or strings.

class nemo_curator.modifiers.QuotationRemover#

Removes quotations from a document following a few rules: - If the document is less than 2 characters, it is returned unchanged. - If the document starts and ends with a quotation mark and there are

no newlines in the document, the quotation marks are removed.

If the document starts and ends with a quotation mark and there are
newlines in the document, the quotation marks are removed only if the first line does not end with a quotation mark.