nemo_curator.stages.text.filters.heuristic.code.code
nemo_curator.stages.text.filters.heuristic.code.code
Module Contents
Classes
API
Bases: DocumentFilter
This filter tries to identify files that have large tensors, or tables stored as raw text within code files. (Source: Starcoder https://arxiv.org/abs/2305.06161)
Bases: DocumentFilter
This filter that has specific conditions depending on the file extension.
Extract filter parameters from csv row
Convert: Language field in dataset -> language field in csv file that defines the filters.
Load csv file that specifies the filter to apply for each (lang, extension).
Filter files based on line length and % alphanumeric characters.
The filtering parameters depend on the file extension, given by ext_to_filter
Bases: DocumentFilter
This filter tries to identify files that have incorrect file extensions. In many cases, these end up being XML files and we try to identify them based on the header. (Source: Starcoder https://arxiv.org/abs/2305.06161)