Synthetic Data#

class nemo_curator.synthetic.NemotronGenerator( llm_client: LLMClient, )#

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

classify_math_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to classify if an entity is related to math :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

classify_python_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to classify if an entity is related to Python :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

convert_response_to_yaml_list( llm_response: str, model: str, prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Converts a response of an LLM to a list of strings by querying an LLM :param llm_response: The original unformatted response of the LLM :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have a {llm_response} parameter that will be populated with the llm_response value passed in this function.
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A parsed list of elements from the original LLM response

generate_closed_qa_instructions( document: str, n_openlines: str | int, model: str, prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of closed Q&A questions based on a reference document :param document: The document to use when generating questions :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - document: Will be populated with the document passed in this function - n_openlines: Will be populated with the n_openlines passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_dialogue( openline: str, user_model: str, assistant_model: str, n_user_turns: int = 3, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict | None = None, user_model_kwargs: dict | None = None, assistant_model_kwargs: dict | None = None, ) → list[dict]#

Prompts an LLM to generate a dialogue based on a given openline. The LLM will alternate impersonating the user and the assistant. :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Parameters:

assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.
n_user_turns – The number of user turns to go through. The openline counts as 1 user turn. Therefore, if there are 3 user turns, 2 will be generated by the LLM impersonating the user.
prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.
assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

generate_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about the world :param n_macro_topics: The number of macro topics to generate. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_macro_topics( n_macro_topics: int | str, school_level: str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about math :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param school_level: The school level the math questions should be targeted at. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_problem( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of math problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a math macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_open_qa_from_topic( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of open Q&A questions based on a topic :param topic: The topic to generate questions for. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about the Python programming language :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_problem( topic: str, n_openlines: str | int, model: str, language: str = 'Python', prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of coding problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

language – The programming language to target when generating these questions.
prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function - language: Will be populated with the language passed in this function Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a Python macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_two_turn_prompt( openline: str, user_model: str, assistant_model: str, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict | None = None, user_model_kwargs: dict | None = None, assistant_model_kwargs: dict | None = None, ) → list[dict]#

Prompts an LLM to generate a response as an assistant, then as the user based on a given openline. The conversation will look like “User -> Assistant -> User” :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Parameters:

assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.
prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.
assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

generate_writing_tasks( topic: str, text_material_type: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of writing tasks based on a topic and document type :param topic: The topic to generate writing tasks for. :param text_material_type: The type of the document the question should ask to generate (e.g., “Email”, “Poem”) :param n_openlines: The number of tasks to generate per topic and text material pair. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - topic: Will be populated with the topic passed in this function - text_material_type: Will be populated with the text_material_type passed in this function - n_openlines: Will be populated with the n_openlines passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

revise_open_qa( openline: str, n_revisions: str | int, model: str, prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to revise an open Q&A question a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the question. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

revise_writing_tasks( openline: str, n_revisions: str | int, model: str, prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to revise a writing task a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the task. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

run_closed_qa_pipeline( documents: list[str], n_openlines: str | int, model: str, closed_qa_prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, ignore_conversion_failure: bool = False, ) → list[tuple[int, str]]#

Runs a pipeline for automatically generating closed Q&A openlines for a dialogue :param documents: A list of documents to generate closed Q&A questions for :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate all the responses.

Parameters:

closed_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - document: Will be populated with one element of the documents list passed in this function No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

Returns:

A list of pairs where the first element represents the index of the document used to generate the question in the documents list and the second element represents a synthetically generated closed Q&A prompt. Example: [(0, “Summarize this document”), …]

run_math_pipeline( n_macro_topics: str | int, school_level: str, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', subtopic_prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', math_problem_prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating math questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param school_level: The school level to target when generating macro topics. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
math_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated math prompts

run_open_qa_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, n_revisions: str | int, model: str, macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating Open Q&A openlines for a dialogue :param n_macro_topics: The number of macro topics to generate :param n_subtopics: The number of subtopics to generate per macro topic :param n_openlines: The number of questions to generate per topic. :param n_revisions: The number of revisions to generate per original question. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
open_qa_from_topics_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template.
revise_open_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with a generated open Q&A openline No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated open Q&A prompts

run_python_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating Python questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
python_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - language: Will be populated with “Python” - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated Python prompts

run_writing_pipeline( topics: list[str], text_material_types: list[str], n_openlines: str | int, n_revisions: str | int, model: str, writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, ignore_conversion_failure: bool = False, ) → list[str]#

Runs a pipeline for automatically generating writing task openlines for a dialogue :param topics: A list of topics to generate tasks for :param text_material_types: A list of writing material types, like “Essay” or “Blog post” :param n_openlines: The number of tasks to generate per (topic, text_material_type) pair. :param n_revisions: The number of revisions to generate per original task. :param model: The name of the model that should be used to generate all the responses.

Parameters:

writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with one element of the topics list passed in this function - text_material_type: Will be populated with one element of the text_material_types list passed in this function No additional parameters may be passed to this prompt template.
revise_writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with one of the writing tasks generated in the pipeline. No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

Returns:

A list of synthetically generated writing task prompts

class nemo_curator.synthetic.AsyncNemotronGenerator( llm_client: AsyncLLMClient, logger: LoggerAdapter | str = './', max_concurrent_requests: int | None = None, )#

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

async classify_math_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to classify if an entity is related to math :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async classify_python_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to classify if an entity is related to Python :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async convert_response_to_yaml_list( llm_response: str, model: str, prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Converts a response of an LLM to a list of strings by querying an LLM :param llm_response: The original unformatted response of the LLM :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have a {llm_response} parameter that will be populated with the llm_response value passed in this function.
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A parsed list of elements from the original LLM response

async generate_closed_qa_instructions( document: str, n_openlines: str | int, model: str, prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of closed Q&A questions based on a reference document :param document: The document to use when generating questions :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - document: Will be populated with the document passed in this function - n_openlines: Will be populated with the n_openlines passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_dialogue( openline: str, user_model: str, assistant_model: str, n_user_turns: int = 3, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict | None = None, user_model_kwargs: dict | None = None, assistant_model_kwargs: dict | None = None, ) → list[dict]#

Prompts an LLM to generate a dialogue based on a given openline. The LLM will alternate impersonating the user and the assistant. :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Parameters:

assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.
n_user_turns – The number of user turns to go through. The openline counts as 1 user turn. Therefore, if there are 3 user turns, 2 will be generated by the LLM impersonating the user.
prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.
assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

async generate_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about the world :param n_macro_topics: The number of macro topics to generate. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_macro_topics( n_macro_topics: int | str, school_level: str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about math :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param school_level: The school level the math questions should be targeted at. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_problem( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of math problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a math macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_open_qa_from_topic( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of open Q&A questions based on a topic :param topic: The topic to generate questions for. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about the Python programming language :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_problem( topic: str, n_openlines: str | int, model: str, language: str = 'Python', prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of coding problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

language – The programming language to target when generating these questions.
prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function - language: Will be populated with the language passed in this function Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a Python macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_two_turn_prompt( openline: str, user_model: str, assistant_model: str, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict | None = None, user_model_kwargs: dict | None = None, assistant_model_kwargs: dict | None = None, ) → list[dict]#

Prompts an LLM to generate a response as an assistant, then as the user based on a given openline. The conversation will look like “User -> Assistant -> User” :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Parameters:

assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.
prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.
assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

async generate_writing_tasks( topic: str, text_material_type: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of writing tasks based on a topic and document type :param topic: The topic to generate writing tasks for. :param text_material_type: The type of the document the question should ask to generate (e.g., “Email”, “Poem”) :param n_openlines: The number of tasks to generate per topic and text material pair. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - topic: Will be populated with the topic passed in this function - text_material_type: Will be populated with the text_material_type passed in this function - n_openlines: Will be populated with the n_openlines passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async revise_open_qa( openline: str, n_revisions: str | int, model: str, prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to revise an open Q&A question a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the question. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async revise_writing_tasks( openline: str, n_revisions: str | int, model: str, prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to revise a writing task a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the task. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async run_closed_qa_pipeline( documents: list[str], n_openlines: str | int, model: str, closed_qa_prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, ) → list[tuple[int, str]]#

Runs a pipeline for automatically generating closed Q&A openlines for a dialogue :param documents: A list of documents to generate closed Q&A questions for :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate all the responses.

Parameters:

closed_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - document: Will be populated with one element of the documents list passed in this function No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.

Returns:

A list of pairs where the first element represents the index of the document used to generate the question in the documents list and the second element represents a synthetically generated closed Q&A prompt. Example: [(0, “Summarize this document”), …]

async run_math_pipeline( n_macro_topics: str | int, school_level: str, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', subtopic_prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', math_problem_prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating math questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param school_level: The school level to target when generating macro topics. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
math_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated math prompts

async run_open_qa_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, n_revisions: str | int, model: str, macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating Open Q&A openlines for a dialogue :param n_macro_topics: The number of macro topics to generate :param n_subtopics: The number of subtopics to generate per macro topic :param n_openlines: The number of questions to generate per topic. :param n_revisions: The number of revisions to generate per original question. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
open_qa_from_topics_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template.
revise_open_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with a generated open Q&A openline No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated open Q&A prompts

async run_python_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating Python questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
python_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - language: Will be populated with “Python” - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated Python prompts

async run_writing_pipeline( topics: list[str], text_material_types: list[str], n_openlines: str | int, n_revisions: str | int, model: str, writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, ) → list[str]#

Runs a pipeline for automatically generating writing task openlines for a dialogue :param topics: A list of topics to generate tasks for :param text_material_types: A list of writing material types, like “Essay” or “Blog post” :param n_openlines: The number of tasks to generate per (topic, text_material_type) pair. :param n_revisions: The number of revisions to generate per original task. :param model: The name of the model that should be used to generate all the responses.

Parameters:

writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with one element of the topics list passed in this function - text_material_type: Will be populated with one element of the text_material_types list passed in this function No additional parameters may be passed to this prompt template.
revise_writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with one of the writing tasks generated in the pipeline. No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.

Returns:

A list of synthetically generated writing task prompts

class nemo_curator.synthetic.NemotronCCGenerator( llm_client: LLMClient, )#

Provides a collection of methods for generating synthetic data described in the Nemotron-CC paper (https://arxiv.org/abs/2412.02595).

distill( document: str, model: str, prompt_template: str = 'Your task is to read and paraphrase the provided text following these instructions:\n- Aim to create a condensed but accurate and informative version of the original text, not a simplistic summary.\n- Capture and preserve the crucial information, key concepts, important values, factual details in the original text, while making it more readable and accessible.\n- Retain technical terms, specialized vocabulary, and complex concepts.\n- Retain examples, explanations of reasoning processes, and supporting evidence to maintain the text\'s depth and context.\n- Only include information that is present in the original text. Do not adding new or unsubstantiated claims.\n- Write the text in plain text without formatting.\n\nHere is the text:\n{document}\n\nTask:\nAfter thoroughly reading the above text, paraphrase it in high-quality and clear English following the instructions. Begin your response with "Paraphrased Text:".', system_prompt: str = 'You are an artificial intelligence assistant. You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Distills the essential content from a document.

Parameters:

document (str) – The input document text to distill.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for distillation. Defaults to DISTILL_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_DISTILL_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

Return type:

List[str]

extract_knowledge( document: str, model: str, prompt_template: str = "Your task is to rewrite knowledge from the provided text following these instructions.\n- Rewrite the text as a passage or passages using easy-to-understand and high-quality English like sentences in textbooks and Wikipedia.\n- Focus on content in disciplines such as humanities, social sciences, natural sciences, technology, engineering, math, law and legal, business, management, art, education, agricultural sciences, politics, and history.\n- Disregard content that does not contain useful facts or knowledge.\n- Retain examples, explanations of reasoning processes, and supporting evidence to maintain the text's depth and context.\n- Do not add or alter details. Only restate what is already in the text.\n- Write in plain text.\n- Do not add titles, subtitles, note, or comment.\n\nText:\n{document}\n\nTask:\nRewrite facts and knowledge from the above text as a passage or passages following the instructions.", system_prompt: str = 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the questions.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Extracts knowledge from the provided document.

Parameters:

document (str) – The input document text from which to extract knowledge.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for knowledge extraction. Defaults to EXTRACT_KNOWLEDGE_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

Return type:

List[str]

generate_diverse_qa( document: str, model: str, prompt_template: str = 'Task:\nRead the text, ask questions and answer them.\n\nFollow these instructions:\n1. Ask diverse questions that require different cognitive skills or cover different aspects of the text.\n2. Ask questions in various forms such as:\n - Yes/No questions that require determining whether a statement is true or false.\n - Open-ended questions that begin with words like what, how, when, where, why and who.\n - Multi-choice questions that offers two or more options to choose from. Include the options in the question.\n - Comparison questions that compare two quantities or objects and determine the relationship between them.\n - Reading comprehension questions that test the ability to understand and analyze the text.\n - Problem-solving questions that test the ability to solve mathematical, physical, or logical problems.\n3. Focus on asking questions about factual information, important knowledge, or concrete details in the text.\n4. Write questions and answers using clear and concise language.\n5. Use plain text. Do not use Markdown.\n6. Each question and answer pair should be on a separate line. Tag the question with "Question:" and the answer with "Answer:".\n\nText:\n{document}\n\nTask:\nAfter reading the above text, ask up to 8 questions and provide the correct answers following the instructions. Give your response in this format:\n\nHere are the questions and answers based on the provided text:\n- Question: [first question] Answer: [first answer]\n- Question: [second question] Answer: [second answer]\n....', system_prompt: str = 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the questions.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Generates diverse QA pairs from the provided document.

Parameters:

document (str) – The input document text used to generate QA pairs.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for generating QA pairs. Defaults to DIVERSE_QA_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

Return type:

List[str]

generate_knowledge_list( document: str, model: str, prompt_template: str = 'Review the text and extract the key information. Follow these instructions:\n- Carefully read the above text and provide a concise and organized list of factual information, concrete details, key concepts, and important numbers and statistics extracted from the text.\n- Ensure each point is clear, specific, and supported by the original text.\n- Ensure the extract text is information-dense and easier to learn from.\n- Do not add titles or headings.\n\nText:\n{document}\n\nTask:\nExtract the factual information, concrete details, and key concepts from the above text following the instructions.', system_prompt: str = 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the questions.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Generates a list of knowledge items from the provided document.

Parameters:

document (str) – The input document text to process.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for generating a knowledge list. Defaults to KNOWLEDGE_LIST_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

Return type:

List[str]

rewrite_to_wikipedia_style( document: str, model: str, prompt_template: str = 'For the following paragraph give me a diverse paraphrase of the same in high quality English language as in sentences on Wikipedia. Begin your answer on a separate line with "Here is a paraphrased version:".\n\nText: {document}', system_prompt: str = 'A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the questions.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Rewrites a document into a Wikipedia-style narrative.

Parameters:

document (str) – The input document text to rewrite.
model (str) – The model identifier to use.
prompt_template (str, optional) – The prompt template for rewriting. Defaults to WIKIPEDIA_REPHRASING_PROMPT_TEMPLATE.
system_prompt (str, optional) – The system prompt to use. Defaults to NEMOTRON_CC_SYSTEM_PROMPT.
prompt_kwargs (dict, optional) – Additional keyword arguments for the prompt. Defaults to {}.
model_kwargs (dict, optional) – Additional keyword arguments for the model invocation. Defaults to {}.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

Return type:

List[str]

class nemo_curator.synthetic.NemotronCCDiverseQAPostprocessor( tokenizer: transformers.AutoTokenizer | None = None, text_field: str = 'text', response_field: str = 'response', max_num_pairs: int = 1, prefix: str = 'Here are the questions and answers based on the provided text:', )#

Postprocesses the output of the Nemotron-CC Diverse QA generation pipeline. This postprocessor will sample a random number of QA pairs up to max_num_pairs. If a tokenizer is provided, the number of QA pairs will be sampled from at least 1 and at most floor(max_num_pairs * num_tokens / 150). Otherwise, the number of QA pairs will be sampled randomly strictly up to max_num_pairs.

The generated QA pairs are shuffled and then appended to the original text.

call( dataset: DocumentDataset, ) → DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:: dataset (DocumentDataset) – The dataset to operate on

class nemo_curator.synthetic.NemotronCCKnowledgeListPostprocessor(text_field: str = 'text')#

Processes and cleans the output generated by the Nemotron-CC Knowledge List pipeline.

This class is responsible for postprocessing raw text responses produced by the Nemotron-CC Knowledge List generation pipeline. It removes formatting artifacts such as bullet point prefixes (”- “) and extra indentation from each line, ensuring that the final output is a clean, uniformly formatted list of knowledge items. The processing includes skipping any initial non-bullet line and merging related lines to reconstruct multi-line questions or answers.

call( dataset: DocumentDataset, ) → DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:: dataset (DocumentDataset) – The dataset to operate on

class nemo_curator.synthetic.AsyncNemotronGenerator( llm_client: AsyncLLMClient, logger: LoggerAdapter | str = './', max_concurrent_requests: int | None = None, )#

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

async classify_math_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to classify if an entity is related to math :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async classify_python_entity( entity: str, model: str, prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to classify if an entity is related to Python :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async convert_response_to_yaml_list( llm_response: str, model: str, prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Converts a response of an LLM to a list of strings by querying an LLM :param llm_response: The original unformatted response of the LLM :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have a {llm_response} parameter that will be populated with the llm_response value passed in this function.
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A parsed list of elements from the original LLM response

async generate_closed_qa_instructions( document: str, n_openlines: str | int, model: str, prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of closed Q&A questions based on a reference document :param document: The document to use when generating questions :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - document: Will be populated with the document passed in this function - n_openlines: Will be populated with the n_openlines passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_dialogue( openline: str, user_model: str, assistant_model: str, n_user_turns: int = 3, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict | None = None, user_model_kwargs: dict | None = None, assistant_model_kwargs: dict | None = None, ) → list[dict]#

Prompts an LLM to generate a dialogue based on a given openline. The LLM will alternate impersonating the user and the assistant. :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Parameters:

assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.
n_user_turns – The number of user turns to go through. The openline counts as 1 user turn. Therefore, if there are 3 user turns, 2 will be generated by the LLM impersonating the user.
prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.
assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

async generate_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about the world :param n_macro_topics: The number of macro topics to generate. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_macro_topics( n_macro_topics: int | str, school_level: str, model: str, prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about math :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param school_level: The school level the math questions should be targeted at. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_problem( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of math problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a math macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_open_qa_from_topic( topic: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of open Q&A questions based on a topic :param topic: The topic to generate questions for. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_macro_topics( n_macro_topics: int | str, model: str, prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of macro topics about the Python programming language :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param model: The name of the model that should be used to generate the macro topics.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_problem( topic: str, n_openlines: str | int, model: str, language: str = 'Python', prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of coding problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Parameters:

language – The programming language to target when generating these questions.
prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function - language: Will be populated with the language passed in this function Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a Python macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_subtopics( macro_topic: str, n_subtopics: int | str, model: str, prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of subtopics relating to a macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_two_turn_prompt( openline: str, user_model: str, assistant_model: str, prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.", prompt_kwargs: dict | None = None, user_model_kwargs: dict | None = None, assistant_model_kwargs: dict | None = None, ) → list[dict]#

Prompts an LLM to generate a response as an assistant, then as the user based on a given openline. The conversation will look like “User -> Assistant -> User” :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Parameters:

assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.
prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.
assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

async generate_writing_tasks( topic: str, text_material_type: str, n_openlines: str | int, model: str, prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to generate a list of writing tasks based on a topic and document type :param topic: The topic to generate writing tasks for. :param text_material_type: The type of the document the question should ask to generate (e.g., “Email”, “Poem”) :param n_openlines: The number of tasks to generate per topic and text material pair. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - topic: Will be populated with the topic passed in this function - text_material_type: Will be populated with the text_material_type passed in this function - n_openlines: Will be populated with the n_openlines passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async revise_open_qa( openline: str, n_revisions: str | int, model: str, prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to revise an open Q&A question a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the question. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async revise_writing_tasks( openline: str, n_revisions: str | int, model: str, prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', prompt_kwargs: dict | None = None, model_kwargs: dict | None = None, ) → list[str]#

Prompts an LLM to revise a writing task a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the task. :param model: The name of the model that should be used to generate the response.

Parameters:

prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function
prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.
model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async run_closed_qa_pipeline( documents: list[str], n_openlines: str | int, model: str, closed_qa_prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, ) → list[tuple[int, str]]#

Runs a pipeline for automatically generating closed Q&A openlines for a dialogue :param documents: A list of documents to generate closed Q&A questions for :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate all the responses.

Parameters:

closed_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - document: Will be populated with one element of the documents list passed in this function No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.

Returns:

A list of pairs where the first element represents the index of the document used to generate the question in the documents list and the second element represents a synthetically generated closed Q&A prompt. Example: [(0, “Summarize this document”), …]

async run_math_pipeline( n_macro_topics: str | int, school_level: str, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.', subtopic_prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.', math_problem_prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating math questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param school_level: The school level to target when generating macro topics. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
math_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated math prompts

async run_open_qa_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, n_revisions: str | int, model: str, macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n', subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.', open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.', revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating Open Q&A openlines for a dialogue :param n_macro_topics: The number of macro topics to generate :param n_subtopics: The number of subtopics to generate per macro topic :param n_openlines: The number of questions to generate per topic. :param n_revisions: The number of revisions to generate per original question. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
open_qa_from_topics_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template.
revise_open_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with a generated open Q&A openline No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated open Q&A prompts

async run_python_pipeline( n_macro_topics: str | int, n_subtopics: str | int, n_openlines: str | int, model: str, macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.', subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.', python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, additional_macro_topics: list[str] | None = None, additional_subtopics: list[str] | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, combine_topics: bool = True, ) → list[str]#

Runs a pipeline for automatically generating Python questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Parameters:

macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.
subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.
python_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - language: Will be populated with “Python” - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.
combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated Python prompts

async run_writing_pipeline( topics: list[str], text_material_types: list[str], n_openlines: str | int, n_revisions: str | int, model: str, writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.', revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.', yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}', base_model_kwargs: dict | None = None, conversion_model_kwargs: dict | None = None, ignore_conversion_failure: bool = False, trim_topics_list: bool = True, ) → list[str]#

Runs a pipeline for automatically generating writing task openlines for a dialogue :param topics: A list of topics to generate tasks for :param text_material_types: A list of writing material types, like “Essay” or “Blog post” :param n_openlines: The number of tasks to generate per (topic, text_material_type) pair. :param n_revisions: The number of revisions to generate per original task. :param model: The name of the model that should be used to generate all the responses.

Parameters:

writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with one element of the topics list passed in this function - text_material_type: Will be populated with one element of the text_material_types list passed in this function No additional parameters may be passed to this prompt template.
revise_writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with one of the writing tasks generated in the pipeline. No additional parameters may be passed to this prompt template.
yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.
base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.
conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.
ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on
trim_topics_list – If True, trims the list of macro topics and subtopics to the desired number.

Returns:

A list of synthetically generated writing task prompts

class nemo_curator.synthetic.NemotronFormatter#

static format_conversation(conv: list[dict]) → str#

Formats a converstation between a user and assistant in the Nemotron 340B format described here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct :param conv: A conversation between a user and assistant

Returns:: A conversation formatted as text

class nemo_curator.synthetic.Mixtral8x7BFormatter#

static format_conversation(conv: list[dict]) → str#

Formats a converstation between a user and assistant in the Mixtral-8x7B format described here: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 :param conv: A conversation between a user and assistant

Returns:: A conversation formatted as text

class nemo_curator.synthetic.NoFormat#

class nemo_curator.synthetic.YamlConversionError(message: str)#