filters.synthetic
#
Module Contents#
Classes#
Discards questions that are not answerable by content present in the context document |
|
Discards questions that are deemed easy to retrieve by retriever modls |
Functions#
API#
- class filters.synthetic.AnswerabilityFilter(
- base_url: str,
- api_key: str,
- model: str,
- answerability_system_prompt: str,
- answerability_user_prompt_template: str,
- num_criteria: int,
- text_fields: list[str] | None = None,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
Discards questions that are not answerable by content present in the context document
Initialization
- keep_document(scores: pandas.Series) pandas.Series #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(df: pandas.DataFrame) pandas.Series #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- class filters.synthetic.EasinessFilter(
- base_url: str,
- api_key: str,
- model: str,
- percentile: float = 0.7,
- truncate: str = 'NONE',
- batch_size: int = 1,
- text_fields: list[str] | None = None,
Bases:
nemo_curator.filters.doc_filter.DocumentFilter
Discards questions that are deemed easy to retrieve by retriever modls
Initialization
- keep_document(scores: pandas.Series) pandas.Series #
Determine whether to keep a document based on its scores.
This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().
Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().
Returns: bool: True if the document should be kept, False otherwise.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- score_document(df: pandas.DataFrame) pandas.Series #
Calculate a score for the given document text.
This method should be implemented by subclasses to define how a document’s text is evaluated and scored.
Args: text (str): The text content of the document to be scored.
Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.
Raises: NotImplementedError: If the method is not implemented in a subclass.
- filters.synthetic.create_client(base_url: str, api_key: str) openai.OpenAI #