Document filtering

The filter_documents utility is a generic document filtering tool that is used frequently throughout the example scripts provided in the examples directory. This utility allows users to easily to use the 30+ filters available within the NeMo Data Curator as well as use their own custom filters to apply to the documents within their corpora. A key requirement to using this utility is that users specify a filter-configuration file. When creating a filter configuration file, it should be a YAML file and should have the following format

Copy
Copied!
            

filter_module: ndc.filter.path.to.Filter params: filter_params: < a list/dictionary of parameters that parameterize the filter >

where filter_module is the path to the implementation of the filter using Python dot notation, and filter_params is either a list or dictionary of parameters used to parameterize the operation of the filter. An example of a configuration file for computing language identification scores is provided within the file config/fasttext_langid.yaml

Copy
Copied!
            

filter_module: ndc.filter.classifier.filter.FastTextLangId params: model_path: "./workspace/dat/lid.176.bin"

As is specified from the above path, the implementation of the filter can found within the file ndc/filter/classifier/filter.py:

Copy
Copied!
            

class FastTextLangId(DocumentFilter): def __init__(self, model_path=None, min_langid_score=0.3): self._model = None if model_path is not None: self._model = fasttext.load_model(model_path) else: raise ValueError("Must provide a valid path to a FastText model " "to identify languages with this filter") self._lang_code = None self._cutoff = min_langid_score self._name = ['langid_score', 'language'] def score_document(self, text): pp = text.strip().replace('\n', ' ') label, score = self._model.predict(pp, k=1) score = score[0] lang_code = label[0][-2:].upper() return [score, lang_code] def remove_document(self, score): return score[0] < self._cutoff

The above filter is derived from an abstract base filter class (DocumentFilter, defined in ndc/filter/doc_filter.py) which requires the implementation of a score_document function and a remove_document function. The score_document function will compute a score given an input document and the remove_document function will compare the score with a user-defined threshold and indicate if the document should be discarded or kept based on the chosen threshold. When using the filter_documents utility, users can specify the argument --log-scores which will cause the computed document scores to be written in-place to additional json fields of each json document. The following command illustrates how this could be done for the above language identification filter

Copy
Copied!
            

filter_documents \ --input-data-dir=< Path to input dir of jsonl files> \ --filter-config-file=./config/fasttext_langid.yaml \ --log-scores \ --log-dir=./workspace/log/lang_id

Using this command wrapped by the appropriate srun call specified in the README will apply the language identification filter in parallel to each document contained within the sharded .jsonl files that make up a corpus. For each document, this specific filter will compute two scores (the lang_id score and the language code) and because the --log-scores argument was specified, both of these scores will update each document’s metadata in place with the keys ‘langid_score’ and ‘language’ (the list of names assigned to self._name). As an example, the following document received scores of ‘AR’ and 0.9985690116882324 after the application of the above filter, hence its json keys are updated as is shown below.

Copy
Copied!
            

{"text": "إذا كنت نسيت اسم العضو أو كلمة المرور , يمكنك طلب اسم العضو إلى بريدك الإلكتروني وإستعادة كلمة المرور. عندما تكتب عنوان بريدك الإلكتروني المسجل, سترسل لك التعليمات حول كيفية إستعادة كلمة المرور، مع اسم العضو.\n\nعنوان البريد الإلكتروني:\n\nجميع الأوقات بتوقيت GMT +3. الساعة الآن 05:11 PM.\n\nالإدارة غير مسئولة عن أي اتفاق تجاري أو تعاوني بين الأعضاء\nفعلى كل شخص تحمل المسئولية تجاه مايقوم به من بيع أو شراء أو اتفاق أو اعطاء معلومات", "id": "45258759-385a-4de2-8df0-5647863ab004", "source_id": "crawl-data-CC-MAIN-2021-04-segments-1610703495901.0-warc-CC-MAIN-20210115134101-20210115164101-00011.warc.gz", "url": "http://2zoo.com/vb/login.php?s=4d1d55bc8b32888d8ffa6ca674e4e465&do=lostpw", "language": "AR", "langid_score": 0.9985690116882324}

In order to further use the filter_documents utility in order to separate low and high quality documents from a corpus, users can additionally specify the --output-retained-document-dir and --output-removed-document-dir. Upon specifying these arguments, the remove_document function will be applied to each document and based on the value returned from this function, the document will be retained or removed from the corpus. For the case of the filter defined above, documents with a lang_id of less than 0.3 will be written to the directory specified by the --removed-document-dir argument. Should users desire to only keep the high-quality documents, they need only to provide the directory specified by the --output-retained-document-dir argument.

Finally, specifying the argument --output-document-score-dir, the filter_documents utility will write to text files the computed document scores for each document. This allows users to easily gather statistics and plot score distributions for further analyses of their data.

© Copyright 2023, NVIDIA. Last updated on Sep 13, 2023.