Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Modifiers#

Base Class#

class nemo_curator.modifiers.DocumentModifier#
property backend: Literal['pandas', 'cudf', 'any']#

The dataframe backend the modifier operates on. Can be ‘pandas’, ‘cudf’, or ‘any’. Defaults to ‘pandas’. :returns: A string representing the dataframe backend the modifier needs as input :rtype: str

Module#

class nemo_curator.Modify(
modifier: DocumentModifier,
text_field='text',
)#
call(
dataset: DocumentDataset,
) DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:

dataset (DocumentDataset) – The dataset to operate on

Modifiers#

class nemo_curator.modifiers.BoilerPlateStringModifier(remove_if_at_top_or_bottom=True)#

If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

class nemo_curator.modifiers.FastTextLabelModifier(label)#
class nemo_curator.modifiers.UnicodeReformatter(
config: ftfy.TextFixerConfig | None = None,
unescape_html: str | bool = 'auto',
remove_terminal_escapes: bool = True,
fix_encoding: bool = True,
restore_byte_a0: bool = True,
replace_lossy_sequences: bool = True,
decode_inconsistent_utf8: bool = True,
fix_c1_controls: bool = True,
fix_latin_ligatures: bool = False,
fix_character_width: bool = False,
uncurl_quotes: bool = False,
fix_line_breaks: bool = False,
fix_surrogates: bool = True,
remove_control_chars: bool = True,
normalization: Literal['NFC', 'NFD', 'NFKC', 'NFKD'] | None = None,
max_decode_length: int = 1000000,
explain: bool = True,
)#
class nemo_curator.modifiers.PiiModifier(
language: str = 'en',
supported_entities: List[str] | None = None,
anonymize_action: str = 'redact',
batch_size: int = 2000,
device: str = 'gpu',
**kwargs,
)#

This class is the entry point to using the PII de-identification module on documents stored as CSV, JSONL or other formats. It works with the Modify functionality as shown below:

dataframe = pd.DataFrame({‘text’: [‘Sarah and Ryan went out to play’, ‘Jensen is the CEO of NVIDIA’]}) dd = dask.dataframe.from_pandas(dataframe, npartitions=1) dataset = DocumentDataset(dd)

modifier = PiiModifier(

batch_size=2000, language=’en’, supported_entities=[‘PERSON’, “EMAIL_ADDRESS”], anonymize_action=’replace’)

modify = Modify(modifier) modified_dataset = modify(dataset) modified_dataset.df.to_json(‘output_files/*.jsonl’, lines=True, orient=’records’)

load_deidentifier()#

Helper function to load the de-identifier

class nemo_curator.modifiers.LineRemover(patterns: List[str])#

Removes lines from a document if the content of the line matches a given string.

class nemo_curator.modifiers.MarkdownRemover#

Removes Markdown formatting in a document including bold, italic, underline, and URL text.

class nemo_curator.modifiers.NewlineNormalizer#

Replaces 3 or more consecutive newline characters with only 2 newline characters.

class nemo_curator.modifiers.UrlRemover#

Removes all URLs in a document.

class nemo_curator.modifiers.Slicer(
left: int | str | None = 0,
right: int | str | None = None,
include_left: bool = True,
include_right: bool = True,
strip: bool = True,
)#

Slices a document based on indices or strings.

class nemo_curator.modifiers.QuotationRemover#

Removes quotations from a document following a few rules: - If the document is less than 2 characters, it is returned unchanged. - If the document starts and ends with a quotation mark and there are

no newlines in the document, the quotation marks are removed.

  • If the document starts and ends with a quotation mark and there are

    newlines in the document, the quotation marks are removed only if the first line does not end with a quotation mark.