nemo_curator.stages.text.filters.heuristic.code.code

View as Markdown

Module Contents

Classes

NameDescription
AlphaFilterThis filter tries to identify files that have large tensors, or tables stored
GeneralCommentToCodeFilter-
HTMLBoilerplateFilterThis filter tries to identify HTML that is largely boilerplate.
NumberOfLinesOfCodeFilter-
PerExtensionFilterThis filter that has specific conditions depending on the file extension.
PythonCommentToCodeFilter-
TokenizerFertilityFilter-
XMLHeaderFilterThis filter tries to identify files that have incorrect file extensions.

API

class nemo_curator.stages.text.filters.heuristic.code.code.AlphaFilter(
min_alpha_ratio: float = 0.25
)

Bases: DocumentFilter

This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)

_name
= 'alpha_filter'
nemo_curator.stages.text.filters.heuristic.code.code.AlphaFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.AlphaFilter.score_document(
source: str
) -> float
class nemo_curator.stages.text.filters.heuristic.code.code.GeneralCommentToCodeFilter(
language: str,
min_comment_to_code_ratio: float = 0.01,
max_comment_to_code_ratio: float = 0.85
)

Bases: DocumentFilter

_name
= 'comment_ratio'
nemo_curator.stages.text.filters.heuristic.code.code.GeneralCommentToCodeFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.GeneralCommentToCodeFilter.score_document(
source: str
) -> float
class nemo_curator.stages.text.filters.heuristic.code.code.HTMLBoilerplateFilter(
min_lang_content_ratio: float = 0.2,
min_lang_content_num_chars: int = 100
)

Bases: DocumentFilter

This filter tries to identify HTML that is largely boilerplate.

_name
= 'html_boilerplate'
nemo_curator.stages.text.filters.heuristic.code.code.HTMLBoilerplateFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.HTMLBoilerplateFilter.score_document(
source: str
) -> float | None
class nemo_curator.stages.text.filters.heuristic.code.code.NumberOfLinesOfCodeFilter(
min_lines: int = 10,
max_lines: int = 20000
)

Bases: DocumentFilter

_name
= 'num_lines'
nemo_curator.stages.text.filters.heuristic.code.code.NumberOfLinesOfCodeFilter.keep_document(
score: int
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.NumberOfLinesOfCodeFilter.score_document(
source: str
) -> int
class nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter(
lang: str,
extension: str,
metadata_file: str = 'code_meta.csv'
)

Bases: DocumentFilter

This filter that has specific conditions depending on the file extension.

_ext_to_filter
= self._load_filter_csv(metadata_file, lang)
_name
= 'per_extension_filter'
nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._alphanum_fraction(
source: str
) -> float
nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._get_filter_params(
row: dict
) -> tuple[bool, int | None, float | None, float | None, float | None]

Extract filter parameters from csv row

nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._language_format_from_dataset(
lang: str
) -> str

Convert: Language field in dataset -> language field in csv file that defines the filters.

nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._line_statistics(
source: str
) -> tuple[int, float]
nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter._load_filter_csv(
path: str,
language: str | None = None
) -> dict

Load csv file that specifies the filter to apply for each (lang, extension).

nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter.keep_document(
score: float | None
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.PerExtensionFilter.score_document(
source: str
) -> float

Filter files based on line length and % alphanumeric characters. The filtering parameters depend on the file extension, given by ext_to_filter

class nemo_curator.stages.text.filters.heuristic.code.code.PythonCommentToCodeFilter(
min_comment_to_code_ratio: float = 0.01,
max_comment_to_code_ratio: float = 0.85
)

Bases: DocumentFilter

_name
= 'python_comment_ratio'
nemo_curator.stages.text.filters.heuristic.code.code.PythonCommentToCodeFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.PythonCommentToCodeFilter.score_document(
source: str
) -> float
class nemo_curator.stages.text.filters.heuristic.code.code.TokenizerFertilityFilter(
path_to_tokenizer: str | None = None,
min_char_to_token_ratio: float = 2.5
)

Bases: DocumentFilter

_name
= 'tokenizer_fertility'
_tokenizer
= sentencepiece.SentencePieceProcessor()
nemo_curator.stages.text.filters.heuristic.code.code.TokenizerFertilityFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.TokenizerFertilityFilter.score_document(
source: str
) -> float
class nemo_curator.stages.text.filters.heuristic.code.code.XMLHeaderFilter(
char_prefix_search_length: int = 100
)

Bases: DocumentFilter

This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)

_name
= 'xml_header'
nemo_curator.stages.text.filters.heuristic.code.code.XMLHeaderFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.heuristic.code.code.XMLHeaderFilter.score_document(
source: str
) -> float