synthetic.nemotron_cc
#
Module Contents#
Classes#
Postprocesses the output of the Nemotron-CC Diverse QA generation pipeline. This postprocessor will sample a random number of QA pairs up to max_num_pairs. If a tokenizer is provided, the number of QA pairs will be sampled from at least 1 and at most floor(max_num_pairs * num_tokens / 150). Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs. |
|
Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595). |
|
Processes and cleans the output generated by the Nemotron-CC Knowledge List pipeline. |
API#
- class synthetic.nemotron_cc.NemotronCCDiverseQAPostprocessor(
- tokenizer: transformers.AutoTokenizer | None = None,
- text_field: str = 'text',
- response_field: str = 'response',
- max_num_pairs: int = 1,
- prefix: str = 'Here are the questions and answers based on the provided text:',
Bases:
nemo_curator.BaseModule
Postprocesses the output of the Nemotron-CC Diverse QA generation pipeline. This postprocessor will sample a random number of QA pairs up to max_num_pairs. If a tokenizer is provided, the number of QA pairs will be sampled from at least 1 and at most floor(max_num_pairs * num_tokens / 150). Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs.
The generated QA pairs are shuffled and then appended to the original text.
Initialization
Args: tokenizer (Optional[AutoTokenizer]): The tokenizer to use for tokenization. If specified, the number of QA pairs will be sampled based on the token count of the text. If not specified, the number of QA pairs will be sampled randomly up to max_num_pairs. text_field (str): The field in the dataset that contains the text used to generate QA pairs. response_field (str): The field in the dataset that contains the response from the LLM. max_num_pairs (int): The maximum number of QA pairs to sample. prefix (str): The prefix of the response from the LLM.
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Performs an arbitrary operation on a dataset
Args: dataset (DocumentDataset): The dataset to operate on
- class synthetic.nemotron_cc.NemotronCCGenerator(llm_client: nemo_curator.services.LLMClient)#
Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595).
Initialization
Initialize the NemotronCCGenerator instance.
Args: llm_client (LLMClient): The language model client used for querying the model.
- distill(
- document: str,
- model: str,
- prompt_template: str = DISTILL_PROMPT_TEMPLATE,
- system_prompt: str = NEMOTRON_CC_DISTILL_SYSTEM_PROMPT,
- prompt_kwargs: dict | None = None,
- model_kwargs: dict | None = None,
Distills the essential content from a document.
Args: document (str): The input document text to distill. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for distillation. Defaults to DISTILL_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_DISTILL_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.
Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.
- extract_knowledge(
- document: str,
- model: str,
- prompt_template: str = EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE,
- system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT,
- prompt_kwargs: dict | None = None,
- model_kwargs: dict | None = None,
Extracts knowledge from the provided document.
Args: document (str): The input document text from which to extract knowledge. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for knowledge extraction. Defaults to EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.
Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.
- generate_diverse_qa(
- document: str,
- model: str,
- prompt_template: str = DIVERSE_QA_PROMPT_TEMPLATE,
- system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT,
- prompt_kwargs: dict | None = None,
- model_kwargs: dict | None = None,
Generates diverse QA pairs from the provided document.
Args: document (str): The input document text used to generate QA pairs. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for generating QA pairs. Defaults to DIVERSE_QA_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.
Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.
- generate_knowledge_list(
- document: str,
- model: str,
- prompt_template: str = KNOWLEDGE_LIST_PROMPT_TEMPLATE,
- system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT,
- prompt_kwargs: dict | None = None,
- model_kwargs: dict | None = None,
Generates a list of knowledge items from the provided document.
Args: document (str): The input document text to process. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for generating a knowledge list. Defaults to KNOWLEDGE_LIST_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.
Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.
- rewrite_to_wikipedia_style(
- document: str,
- model: str,
- prompt_template: str = WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE,
- system_prompt: str = NEMOTRON_CC_SYSTEM_PROMPT,
- prompt_kwargs: dict | None = None,
- model_kwargs: dict | None = None,
Rewrites a document into a Wikipedia-style narrative.
Args: document (str): The input document text to rewrite. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for rewriting. Defaults to WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.
Returns: List[str]: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.
- class synthetic.nemotron_cc.NemotronCCKnowledgeListPostprocessor(text_field: str = 'text')#
Bases:
nemo_curator.BaseModule
Processes and cleans the output generated by the Nemotron-CC Knowledge List pipeline.
This class is responsible for postprocessing raw text responses produced by the Nemotron-CC Knowledge List generation pipeline. It removes formatting artifacts such as bullet point prefixes (”- “) and extra indentation from each line, ensuring that the final output is a clean, uniformly formatted list of knowledge items. The processing includes skipping any initial non-bullet line and merging related lines to reconstruct multi-line questions or answers.
Initialization
Constructs a Module
Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Performs an arbitrary operation on a dataset
Args: dataset (DocumentDataset): The dataset to operate on