Modifiers#
Base Class#
Module#
- class nemo_curator.Modify(
- modifier: DocumentModifier,
- text_field: str = 'text',
- call(
- dataset: DocumentDataset,
Performs an arbitrary operation on a dataset
- Parameters:
dataset (DocumentDataset) – The dataset to operate on
Modifiers#
- class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom: bool = True)#
If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.
- class nemo_curator.modifiers.FastTextLabelModifier(label: str)#
- class nemo_curator.modifiers.UnicodeReformatter(
- config: ftfy.TextFixerConfig | None = None,
- unescape_html: str | bool = 'auto',
- remove_terminal_escapes: bool = True,
- fix_encoding: bool = True,
- restore_byte_a0: bool = True,
- replace_lossy_sequences: bool = True,
- decode_inconsistent_utf8: bool = True,
- fix_c1_controls: bool = True,
- fix_latin_ligatures: bool = False,
- fix_character_width: bool = False,
- uncurl_quotes: bool = False,
- fix_line_breaks: bool = False,
- fix_surrogates: bool = True,
- remove_control_chars: bool = True,
- normalization: Literal['NFC', 'NFD', 'NFKC', 'NFKD'] | None = None,
- max_decode_length: int = 1000000,
- explain: bool = True,
- class nemo_curator.modifiers.PiiModifier(
- language: str = 'en',
- supported_entities: list[str] | None = None,
- anonymize_action: str = 'redact',
- batch_size: int = 2000,
- device: str = 'gpu',
- **kwargs,
This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:
dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)
- modifier = PiiModifier(
batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)
modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)
- load_deidentifier() PiiDeidentifier #
Helper function to load the de-identifier
- class nemo_curator.modifiers.LineRemover(patterns: list[str])#
Removes lines from a document if the content of the line matches a given string.
- class nemo_curator.modifiers.MarkdownRemover#
Removes Markdown formatting in a document including bold, italic, underline, and URL text.
- class nemo_curator.modifiers.NewlineNormalizer#
Replaces 3 or more consecutive newline characters with only 2 newline characters.
- class nemo_curator.modifiers.UrlRemover#
Removes all URLs in a document.
- class nemo_curator.modifiers.Slicer(
- left: int | str | None = 0,
- right: str | int | None = None,
- include_left: bool = True,
- include_right: bool = True,
- strip: bool = True,
Slices a document based on indices or strings.
- class nemo_curator.modifiers.QuotationRemover#
Removes quotations from a document following a few rules: - If the document is less than 2 characters, it is returned unchanged. - If the document starts and ends with a quotation mark and there are
no newlines in the document, the quotation marks are removed.
- If the document starts and ends with a quotation mark and there are
newlines in the document, the quotation marks are removed only if the first line does not end with a quotation mark.