`synthetic.nemotron_cc`#

Module Contents#

Classes#

`NemotronCCDiverseQAPostprocessor`	Postprocesses the output of the Nemotron-CC Diverse QA generation pipeline. This postprocessor will sample a random number of QA pairs up to max_num_pairs. If a tokenizer is provided, the number of QA pairs will be sampled from at least 1 and at most floor(max_num_pairs * num_tokens / 150). Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs.
`NemotronCCGenerator`	Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595).
`NemotronCCKnowledgeListPostprocessor`	Processes and cleans the output generated by the Nemotron-CC Knowledge List pipeline.

API#

class synthetic.nemotron_cc.NemotronCCDiverseQAPostprocessor( tokenizer: transformers.AutoTokenizer | None = None, text_field: str = 'text', response_field: str = 'response', max_num_pairs: int = 1, prefix: str = 'Here are the questions and answers based on the provided text:', )#

Bases: nemo_curator.BaseModule

Postprocesses the output of the Nemotron-CC Diverse QA generation pipeline. This postprocessor will sample a random number of QA pairs up to max_num_pairs. If a tokenizer is provided, the number of QA pairs will be sampled from at least 1 and at most floor(max_num_pairs * num_tokens / 150). Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs.

The generated QA pairs are shuffled and then appended to the original text.

Initialization

Args: tokenizer (Optional[AutoTokenizer]): The tokenizer to use for tokenization. If specified, the number of QA pairs will be sampled based on the token count of the text. If not specified, the number of QA pairs will be sampled randomly up to max_num_pairs. text_field (str): The field in the dataset that contains the text used to generate QA pairs. response_field (str): The field in the dataset that contains the response from the LLM. max_num_pairs (int): The maximum number of QA pairs to sample. prefix (str): The prefix of the response from the LLM.

call( dataset: nemo_curator.datasets.DocumentDataset, ) → nemo_curator.datasets.DocumentDataset#

Performs an arbitrary operation on a dataset

Args: dataset (DocumentDataset): The dataset to operate on

class synthetic.nemotron_cc.NemotronCCGenerator(llm_client: nemo_curator.services.LLMClient)#

Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595).

Initialization

Initialize the NemotronCCGenerator instance.

Args: llm_client (LLMClient): The language model client used for querying the model.

distill( document: str, model: str, prompt_template: str = DISTILL_PROMPT_TEMPLATE, system_prompt: str = NEMOTRON_CC_DISTILL_SYSTEM_PROMPT, prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Distills the essential content from a document.

Args: document (str): The input document text to distill. model (str): The model identifier to use. prompt_template (str, optional): The prompt template for distillation. Defaults to DISTILL_PROMPT_TEMPLATE. system_prompt (str, optional): The system prompt to use. Defaults to NEMOTRON_CC_DISTILL_SYSTEM_PROMPT. prompt_kwargs (dict, optional): Additional keyword arguments for the prompt. Defaults to {}. model_kwargs (dict, optional): Additional keyword arguments for the model invocation. Defaults to {}.