Modifiers#

Base Class#

class nemo_curator.modifiers.DocumentModifier#
property backend: Literal['pandas', 'cudf', 'any']#

The dataframe backend the modifier operates on. Can be ‘pandas’, ‘cudf’, or ‘any’. Defaults to ‘pandas’. :returns: A string representing the dataframe backend the modifier needs as input :rtype: str

Module#

class nemo_curator.Modify(
modifier: DocumentModifier,
text_field: str = 'text',
)#
call(
dataset: DocumentDataset,
) DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:

dataset (DocumentDataset) – The dataset to operate on

Modifiers#

class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom: bool = True)#

If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

class nemo_curator.modifiers.FastTextLabelModifier(label: str)#
class nemo_curator.modifiers.UnicodeReformatter(
config: ftfy.TextFixerConfig | None = None,
unescape_html: str | bool = 'auto',
remove_terminal_escapes: bool = True,
fix_encoding: bool = True,
restore_byte_a0: bool = True,
replace_lossy_sequences: bool = True,
decode_inconsistent_utf8: bool = True,
fix_c1_controls: bool = True,
fix_latin_ligatures: bool = False,
fix_character_width: bool = False,
uncurl_quotes: bool = False,
fix_line_breaks: bool = False,
fix_surrogates: bool = True,
remove_control_chars: bool = True,
normalization: Literal['NFC', 'NFD', 'NFKC', 'NFKD'] | None = None,
max_decode_length: int = 1000000,
explain: bool = True,
)#
class nemo_curator.modifiers.PiiModifier(
language: str = 'en',
supported_entities: list[str] | None = None,
anonymize_action: str = 'redact',
batch_size: int = 2000,
device: str = 'gpu',
**kwargs,
)#

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = PiiModifier(

batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)

load_deidentifier() PiiDeidentifier#

Helper function to load the de-identifier

class nemo_curator.modifiers.LineRemover(patterns: list[str])#

Removes lines from a document if the content of the line matches a given string.

class nemo_curator.modifiers.MarkdownRemover#

Removes Markdown formatting in a document including bold, italic, underline, and URL text.

class nemo_curator.modifiers.NewlineNormalizer#

Replaces 3 or more consecutive newline characters with only 2 newline characters.

class nemo_curator.modifiers.UrlRemover#

Removes all URLs in a document.

class nemo_curator.modifiers.Slicer(
left: int | str | None = 0,
right: str | int | None = None,
include_left: bool = True,
include_right: bool = True,
strip: bool = True,
)#

Slices a document based on indices or strings.

class nemo_curator.modifiers.QuotationRemover#

Removes quotations from a document following a few rules: - If the document is less than 2 characters, it is returned unchanged. - If the document starts and ends with a quotation mark and there are

no newlines in the document, the quotation marks are removed.

  • If the document starts and ends with a quotation mark and there are

    newlines in the document, the quotation marks are removed only if the first line does not end with a quotation mark.