Quality Assessment & Filtering
Quality Assessment & Filtering
Quality Assessment & Filtering
Score and remove low-quality content using heuristics and ML classifiers to prepare your data for model training using NeMo Curator’s tools and utilities.
Large datasets often contain many documents considered “low quality.” In this context, “low quality” means data we do not want downstream models to learn from, and “high quality” is data we do want them to learn from. The metrics that define quality can vary widely.
NeMo Curator’s filtering framework is built around several key components that work within the data processing architecture :
The ScoreFilter is at the center of filtering in NeMo Curator. It applies a filter to a document and optionally saves the score as metadata:
Default Executor: When you call pipeline.run() without specifying an executor, NeMo Curator automatically uses XennaExecutor() as the default. You can optionally specify a different executor by passing it as a parameter: pipeline.run(executor=my_executor).
The filter object implements two key methods:
score_document: Computes a quality score for a documentkeep_document: Determines if a document should be kept based on its scoreFilter text using configurable rules and metrics rules metrics fast
Filter text using trained quality classifiers ml-models quality scoring
GPU-accelerated classification with pre-trained models gpu distributed scalable
NeMo Curator provides programmatic interfaces for document filtering through the Pipeline framework:
When filtering large datasets, consider these performance tips: