The filter_documents
utility is a generic document filtering tool that is used frequently throughout the example scripts provided in the examples
directory. This utility allows users to easily to use the 30+ filters available within the NeMo Data Curator as well as use their own custom filters to apply to the documents within their corpora. A key requirement to using this utility is that users specify a filter-configuration file. When creating a filter configuration file, it should be a YAML file and should have the following format
filter_module: ndc.filter.path.to.Filter
params:
filter_params: < a list/dictionary of parameters that parameterize the filter >
where filter_module
is the path to the implementation of the filter using Python dot notation, and filter_params
is either a list or dictionary of parameters used to parameterize the operation of the filter. An example of a configuration file for computing language identification scores is provided within the file config/fasttext_langid.yaml
filter_module: ndc.filter.classifier.filter.FastTextLangId
params:
model_path: "./workspace/dat/lid.176.bin"
As is specified from the above path, the implementation of the filter can found within the file ndc/filter/classifier/filter.py
:
class FastTextLangId(DocumentFilter):
def __init__(self, model_path=None, min_langid_score=0.3):
self._model = None
if model_path is not None:
self._model = fasttext.load_model(model_path)
else:
raise ValueError("Must provide a valid path to a FastText model "
"to identify languages with this filter")
self._lang_code = None
self._cutoff = min_langid_score
self._name = ['langid_score', 'language']
def score_document(self, text):
pp = text.strip().replace('\n', ' ')
label, score = self._model.predict(pp, k=1)
score = score[0]
lang_code = label[0][-2:].upper()
return [score, lang_code]
def remove_document(self, score):
return score[0] < self._cutoff
The above filter is derived from an abstract base filter class (DocumentFilter
, defined in ndc/filter/doc_filter.py
) which requires the implementation of a score_document
function and a remove_document
function. The score_document
function will compute a score given an input document and the remove_document
function will compare the score with a user-defined threshold and indicate if the document should be discarded or kept based on the chosen threshold. When using the filter_documents
utility, users can specify the argument --log-scores
which will cause the computed document scores to be written in-place to additional json fields of each json document. The following command illustrates how this could be done for the above language identification filter
filter_documents \
--input-data-dir=< Path to input dir of jsonl files> \
--filter-config-file=./config/fasttext_langid.yaml \
--log-scores \
--log-dir=./workspace/log/lang_id
Using this command wrapped by the appropriate srun
call specified in the README will apply the language identification filter in parallel to each document contained within the sharded .jsonl
files that make up a corpus. For each document, this specific filter will compute two scores (the lang_id score and the language code) and because the --log-scores
argument was specified, both of these scores will update each document’s metadata in place with the keys ‘langid_score’ and ‘language’ (the list of names assigned to self._name
). As an example, the following document received scores of ‘AR’ and 0.9985690116882324 after the application of the above filter, hence its json keys are updated as is shown below.
{"text": "إذا كنت نسيت اسم العضو أو كلمة المرور , يمكنك طلب اسم العضو إلى بريدك الإلكتروني وإستعادة كلمة المرور. عندما تكتب عنوان بريدك الإلكتروني المسجل, سترسل لك التعليمات حول كيفية إستعادة كلمة المرور، مع اسم العضو.\n\nعنوان البريد الإلكتروني:\n\nجميع الأوقات بتوقيت GMT +3. الساعة الآن 05:11 PM.\n\nالإدارة غير مسئولة عن أي اتفاق تجاري أو تعاوني بين الأعضاء\nفعلى كل شخص تحمل المسئولية تجاه مايقوم به من بيع أو شراء أو اتفاق أو اعطاء معلومات",
"id": "45258759-385a-4de2-8df0-5647863ab004", "source_id": "crawl-data-CC-MAIN-2021-04-segments-1610703495901.0-warc-CC-MAIN-20210115134101-20210115164101-00011.warc.gz",
"url": "http://2zoo.com/vb/login.php?s=4d1d55bc8b32888d8ffa6ca674e4e465&do=lostpw", "language": "AR", "langid_score": 0.9985690116882324}
In order to further use the filter_documents
utility in order to separate low and high quality documents from a corpus, users can additionally specify the --output-retained-document-dir
and --output-removed-document-dir
. Upon specifying these arguments, the remove_document function will be applied to each document and based on the value returned from this function, the document will be retained or removed from the corpus. For the case of the filter defined above, documents with a lang_id
of less than 0.3 will be written to the directory specified by the --removed-document-dir
argument. Should users desire to only keep the high-quality documents, they need only to provide the directory specified by the --output-retained-document-dir
argument.
Finally, specifying the argument --output-document-score-dir
, the filter_documents
utility will write to text files the computed document scores for each document. This allows users to easily gather statistics and plot score distributions for further analyses of their data.