nemo_curator.stages.text.modifiers.string.c4

View as Markdown

Module Contents

Classes

NameDescription
BoilerPlateStringModifierIf the sentence contains any of the boilerplate strings then discard.

API

class nemo_curator.stages.text.modifiers.string.c4.BoilerPlateStringModifier(
remove_if_at_top_or_bottom: bool = True
)

Bases: DocumentModifier

If the sentence contains any of the boilerplate strings then discard. This includes things like “terms of use”, “privacy policy”, etc. Source: Adapted significantly from Google C4 processing.

_boilerplate_paragraph_indices
= []
_name
= 'boilerplate_string_ratio'
nemo_curator.stages.text.modifiers.string.c4.BoilerPlateStringModifier.modify_document(
text: str
) -> str