`filters.synthetic`#

Module Contents#

Classes#

`AnswerabilityFilter`	Discards questions that are not answerable by content present in the context document
`EasinessFilter`	Discards questions that are deemed easy to retrieve by retriever modls

Functions#

create_client

API#

class filters.synthetic.AnswerabilityFilter( base_url: str, api_key: str, model: str, answerability_system_prompt: str, answerability_user_prompt_template: str, num_criteria: int, text_fields: list[str] | None = None, )#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

Discards questions that are not answerable by content present in the context document

Initialization

keep_document(scores: pandas.Series) → pandas.Series#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(df: pandas.DataFrame) → pandas.Series#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

class filters.synthetic.EasinessFilter( base_url: str, api_key: str, model: str, percentile: float = 0.7, truncate: str = 'NONE', batch_size: int = 1, text_fields: list[str] | None = None, )#

Bases: nemo_curator.filters.doc_filter.DocumentFilter

Discards questions that are deemed easy to retrieve by retriever modls

Initialization

keep_document(scores: pandas.Series) → pandas.Series#

Determine whether to keep a document based on its scores.

This method should be implemented by subclasses to define the criteria for keeping or discarding a document based on the scores calculated by score_document().

Args: scores (float | list[int | float]): The score or set of scores returned by score_document(). The type should match what is returned by score_document().

Returns: bool: True if the document should be kept, False otherwise.

Raises: NotImplementedError: If the method is not implemented in a subclass.

score_document(df: pandas.DataFrame) → pandas.Series#

Calculate a score for the given document text.

This method should be implemented by subclasses to define how a document’s text is evaluated and scored.

Args: text (str): The text content of the document to be scored.

Returns: Any: A score or set of scores representing the document’s relevance or quality. The type and structure of the return value should be consistent for each subclass.

Raises: NotImplementedError: If the method is not implemented in a subclass.

filters.synthetic.create_client(base_url: str, api_key: str) → openai.OpenAI#

filters.synthetic#

Module Contents#

Classes#

Functions#

API#

`filters.synthetic`#