`synthetic.async_nemotron_cc`#

Module Contents#

Classes#

AsyncNemotronCCGenerator

Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595).

API#

class synthetic.async_nemotron_cc.AsyncNemotronCCGenerator( llm_client: nemo_curator.services.AsyncLLMClient, )#

Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595).

Initialization

Initialize the AsyncNemotronCCGenerator instance.

Args: llm_client (LLMClient): The language model client used for querying the model.

async distill( document: str, model: str, prompt_template: str = DISTILL_PROMPT_TEMPLATE, system_prompt: str = NEMOTRON_CC_DISTILL_SYSTEM_PROMPT, prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Distills the essential content from a document.

Args: document (str): The input document text to distill. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for distillation. Defaults to DISTILL_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_DISTILL_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.

Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async extract_knowledge( document: str, model: str, prompt_template: str = EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE, system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT, prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Extracts knowledge from the provided document.

Args: document (str): The input document text from which to extract knowledge. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for knowledge extraction. Defaults to EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.

Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_diverse_qa( document: str, model: str, prompt_template: str = DIVERSE_QA_PROMPT_TEMPLATE, system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT, prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Generates diverse QA pairs from the provided document.

Args: document (str): The input document text used to generate QA pairs. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for generating QA pairs. Defaults to DIVERSE_QA_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.

Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_knowledge_list( document: str, model: str, prompt_template: str = KNOWLEDGE_LIST_PROMPT_TEMPLATE, system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT, prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Generates a list of knowledge items from the provided document.

Args: document (str): The input document text to process. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for generating a knowledge list. Defaults to KNOWLEDGE_LIST_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.

Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async rewrite_to_wikipedia_style( document: str, model: str, prompt_template: str = WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE, system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT, prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Rewrites a document into a Wikipedia-style narrative.

Args: document (str): The input document text to rewrite. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for rewriting. Defaults to WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.

Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.