Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Synthetic Data#

class nemo_curator.synthetic.NemotronGenerator(
llm_client: LLMClient,
)#

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

classify_math_entity(
entity: str,
model: str,
prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".',
prompt_kwargs: dict = {},
model_kwargs={},
) List[str]#

Prompts an LLM to classify if an entity is related to math :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

classify_python_entity(
entity: str,
model: str,
prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to classify if an entity is related to Python :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

convert_response_to_yaml_list(
llm_response: str,
model: str,
prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Converts a response of an LLM to a list of strings by querying an LLM :param llm_response: The original unformatted response of the LLM :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have a {llm_response} parameter that will be populated with the llm_response value passed in this function.

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A parsed list of elements from the original LLM response

generate_closed_qa_instructions(
document: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of closed Q&A questions based on a reference document :param document: The document to use when generating questions :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - document: Will be populated with the document passed in this function - n_openlines: Will be populated with the n_openlines passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_dialogue(
openline: str,
user_model: str,
assistant_model: str,
n_user_turns: int = 3,
prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.",
prompt_kwargs: dict = {},
user_model_kwargs: dict = {},
assistant_model_kwargs: dict = {},
) List[dict]#

Prompts an LLM to generate a dialogue based on a given openline. The LLM will alternate impersonating the user and the assistant. :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.

  • n_user_turns – The number of user turns to go through. The openline counts as 1 user turn. Therefore, if there are 3 user turns, 2 will be generated by the LLM impersonating the user.

  • prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.

  • assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

generate_macro_topics(
n_macro_topics: int | str,
model: str,
prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of macro topics about the world :param n_macro_topics: The number of macro topics to generate. :param model: The name of the model that should be used to generate the macro topics.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_macro_topics(
n_macro_topics: int | str,
school_level: str,
model: str,
prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of macro topics about math :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param school_level: The school level the math questions should be targeted at. :param model: The name of the model that should be used to generate the macro topics.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_problem(
topic: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of math problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of subtopics relating to a math macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_open_qa_from_topic(
topic: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of open Q&A questions based on a topic :param topic: The topic to generate questions for. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_macro_topics(
n_macro_topics: int | str,
model: str,
prompt_template: str = 'List {n_macro_topics} important concepts in the python language.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of macro topics about the Python programming language :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param model: The name of the model that should be used to generate the macro topics.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_problem(
topic: str,
n_openlines: str | int,
model: str,
language='Python',
prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of coding problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • language – The programming language to target when generating these questions.

  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function - language: Will be populated with the language passed in this function Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of subtopics relating to a Python macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of subtopics relating to a macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_two_turn_prompt(
openline: str,
user_model: str,
assistant_model: str,
prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.",
prompt_kwargs: dict = {},
user_model_kwargs: dict = {},
assistant_model_kwargs: dict = {},
) List[dict]#

Prompts an LLM to generate a response as an assistant, then as the user based on a given openline. The conversation will look like “User -> Assistant -> User” :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.

  • prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.

  • assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

generate_writing_tasks(
topic: str,
text_material_type: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of writing tasks based on a topic and document type :param topic: The topic to generate writing tasks for. :param text_material_type: The type of the document the question should ask to generate (e.g., “Email”, “Poem”) :param n_openlines: The number of tasks to generate per topic and text material pair. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - topic: Will be populated with the topic passed in this function - text_material_type: Will be populated with the text_material_type passed in this function - n_openlines: Will be populated with the n_openlines passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

revise_open_qa(
openline: str,
n_revisions: str | int,
model: str,
prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to revise an open Q&A question a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the question. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

revise_writing_tasks(
openline: str,
n_revisions: str | int,
model: str,
prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to revise a writing task a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the task. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

run_closed_qa_pipeline(
documents: List[str],
n_openlines: str | int,
model: str,
closed_qa_prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
ignore_conversion_failure: bool = False,
) List[Tuple[int, str]]#

Runs a pipeline for automatically generating closed Q&A openlines for a dialogue :param documents: A list of documents to generate closed Q&A questions for :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • closed_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - document: Will be populated with one element of the documents list passed in this function No additional parameters may be passed to this prompt template.

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

Returns:

A list of pairs where the first element represents the index of the document used to generate the question in the documents list and the second element represents a synthetically generated closed Q&A prompt. Example: [(0, “Summarize this document”), …]

run_math_pipeline(
n_macro_topics: str | int,
school_level: str,
n_subtopics: str | int,
n_openlines: str | int,
model: str,
macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.',
subtopic_prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.',
math_problem_prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
additional_macro_topics: List[str] = [],
additional_subtopics: List[str] = [],
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) List[str]#

Runs a pipeline for automatically generating math questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param school_level: The school level to target when generating macro topics. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function No additional parameters may be passed to this prompt template.

  • subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.

  • math_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

  • combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated math prompts

run_open_qa_pipeline(
n_macro_topics: str | int,
n_subtopics: str | int,
n_openlines: str | int,
n_revisions: str | int,
model: str,
macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n',
subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.',
open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.',
revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
additional_macro_topics: List[str] = [],
additional_subtopics: List[str] = [],
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) List[str]#

Runs a pipeline for automatically generating Open Q&A openlines for a dialogue :param n_macro_topics: The number of macro topics to generate :param n_subtopics: The number of subtopics to generate per macro topic :param n_openlines: The number of questions to generate per topic. :param n_revisions: The number of revisions to generate per original question. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.

  • subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.

  • open_qa_from_topics_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template.

  • revise_open_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with a generated open Q&A openline No additional parameters may be passed to this prompt template.

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

  • combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated open Q&A prompts

run_python_pipeline(
n_macro_topics: str | int,
n_subtopics: str | int,
n_openlines: str | int,
model: str,
macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.',
subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.',
python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
additional_macro_topics: List[str] = [],
additional_subtopics: List[str] = [],
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) List[str]#

Runs a pipeline for automatically generating Python questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.

  • subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.

  • python_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - language: Will be populated with “Python” - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

  • combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated Python prompts

run_writing_pipeline(
topics: List[str],
text_material_types: List[str],
n_openlines: str | int,
n_revisions: str | int,
model: str,
writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.',
revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
ignore_conversion_failure: bool = False,
) List[str]#

Runs a pipeline for automatically generating writing task openlines for a dialogue :param topics: A list of topics to generate tasks for :param text_material_types: A list of writing material types, like “Essay” or “Blog post” :param n_openlines: The number of tasks to generate per (topic, text_material_type) pair. :param n_revisions: The number of revisions to generate per original task. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with one element of the topics list passed in this function - text_material_type: Will be populated with one element of the text_material_types list passed in this function No additional parameters may be passed to this prompt template.

  • revise_writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with one of the writing tasks generated in the pipeline. No additional parameters may be passed to this prompt template.

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

Returns:

A list of synthetically generated writing task prompts

class nemo_curator.synthetic.AsyncNemotronGenerator(
llm_client: AsyncLLMClient,
logger: LoggerAdapter | str = './',
max_concurrent_requests: int | None = None,
)#

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

async classify_math_entity(
entity: str,
model: str,
prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Math concepts taught at elementary school, middle school, high school, and univiersity.\n- Important mathematics axioms, theorems, algorithms, equations, or inequalities.\n- Representative math problems, functions, and applications.\n\nYour answer should start with "Yes" or "No".',
prompt_kwargs: dict = {},
model_kwargs={},
) List[str]#

Prompts an LLM to classify if an entity is related to math :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async classify_python_entity(
entity: str,
model: str,
prompt_template: str = 'Does the concept "{entity}" belong to one of the following categories?\n- Programming concepts like loops, functions, and data structures in python.\n- Important functions, objects, or libraries in python.\n- Mathematical concepts like linear algebra which can be implemented in python.\n- Basic algorithms or problems in computer science likes Greedy Search and Dynamics programming which can be addressed in python.\n\nYour answer should start with "Yes" or "No".',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to classify if an entity is related to Python :param entity: The entity to classify :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async convert_response_to_yaml_list(
llm_response: str,
model: str,
prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Converts a response of an LLM to a list of strings by querying an LLM :param llm_response: The original unformatted response of the LLM :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have a {llm_response} parameter that will be populated with the llm_response value passed in this function.

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A parsed list of elements from the original LLM response

async generate_closed_qa_instructions(
document: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of closed Q&A questions based on a reference document :param document: The document to use when generating questions :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - document: Will be populated with the document passed in this function - n_openlines: Will be populated with the n_openlines passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_dialogue(
openline: str,
user_model: str,
assistant_model: str,
n_user_turns: int = 3,
prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.",
prompt_kwargs: dict = {},
user_model_kwargs: dict = {},
assistant_model_kwargs: dict = {},
) List[dict]#

Prompts an LLM to generate a dialogue based on a given openline. The LLM will alternate impersonating the user and the assistant. :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.

  • n_user_turns – The number of user turns to go through. The openline counts as 1 user turn. Therefore, if there are 3 user turns, 2 will be generated by the LLM impersonating the user.

  • prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.

  • assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

async generate_macro_topics(
n_macro_topics: int | str,
model: str,
prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of macro topics about the world :param n_macro_topics: The number of macro topics to generate. :param model: The name of the model that should be used to generate the macro topics.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_macro_topics(
n_macro_topics: int | str,
school_level: str,
model: str,
prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of macro topics about math :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param school_level: The school level the math questions should be targeted at. :param model: The name of the model that should be used to generate the macro topics.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_problem(
topic: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of math problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_math_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of subtopics relating to a math macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_open_qa_from_topic(
topic: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of open Q&A questions based on a topic :param topic: The topic to generate questions for. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_macro_topics(
n_macro_topics: int | str,
model: str,
prompt_template: str = 'List {n_macro_topics} important concepts in the python language.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of macro topics about the Python programming language :param n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. :param model: The name of the model that should be used to generate the macro topics.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_problem(
topic: str,
n_openlines: str | int,
model: str,
language='Python',
prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of coding problems based on a topic :param topic: The topic to generate problems for. :param n_openlines: The number of problems to generate per topic. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • language – The programming language to target when generating these questions.

  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function - language: Will be populated with the language passed in this function Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_python_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of subtopics relating to a Python macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of subtopics relating to a macro topic :param macro_topic: The macro topic to generate subtopics for. :param n_subtopics: The number of subtopics to generate per macro topic :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async generate_two_turn_prompt(
openline: str,
user_model: str,
assistant_model: str,
prompt_template: str = "Here is a conversation between a user and an assistant.\n<|The Start of Assistant's Conversation with User|>\n{conversation_history}\n<|The End of Assistant's Conversation with User|>\n\nGiven the conversation above, generate a followup request or question in the tone of User. Directly give me the question without extraneous words.",
prompt_kwargs: dict = {},
user_model_kwargs: dict = {},
assistant_model_kwargs: dict = {},
) List[dict]#

Prompts an LLM to generate a response as an assistant, then as the user based on a given openline. The conversation will look like “User -> Assistant -> User” :param openline: The openline that will comprise the first user turn. :param user_model: The model that will be impersonating the user.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • assistant_model – The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor.

  • prompt_template – A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • user_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user.

  • assistant_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant.

Returns:

A conversation between a User and Assistant

async generate_writing_tasks(
topic: str,
text_material_type: str,
n_openlines: str | int,
model: str,
prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to generate a list of writing tasks based on a topic and document type :param topic: The topic to generate writing tasks for. :param text_material_type: The type of the document the question should ask to generate (e.g., “Email”, “Poem”) :param n_openlines: The number of tasks to generate per topic and text material pair. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - topic: Will be populated with the topic passed in this function - text_material_type: Will be populated with the text_material_type passed in this function - n_openlines: Will be populated with the n_openlines passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async revise_open_qa(
openline: str,
n_revisions: str | int,
model: str,
prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to revise an open Q&A question a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the question. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async revise_writing_tasks(
openline: str,
n_revisions: str | int,
model: str,
prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.',
prompt_kwargs: dict = {},
model_kwargs: dict = {},
) List[str]#

Prompts an LLM to revise a writing task a given number of times :param openline: An openline to revise :param n_revisions: The number of revisions to generate for the task. :param model: The name of the model that should be used to generate the response.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • prompt_template – A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function

  • prompt_kwargs – Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template.

  • model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call.

Returns:

A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

async run_closed_qa_pipeline(
documents: List[str],
n_openlines: str | int,
model: str,
closed_qa_prompt_template: str = 'TEXT: {document}\n\nGiven the text above, can you come up with {n_openlines} questions or tasks? They can be any of the follows:\n1. Asking certain information in the text;\n2. Summarizing, repharsing or explaining the text;\n3. Writing something similar to the text;\n4. Any other reasonable requests related to the text.\n\nMake the questions or tasks as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
ignore_conversion_failure: bool = False,
) List[Tuple[int, str]]#

Runs a pipeline for automatically generating closed Q&A openlines for a dialogue :param documents: A list of documents to generate closed Q&A questions for :param n_openlines: The number of questions to generate per document. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • closed_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - document: Will be populated with one element of the documents list passed in this function No additional parameters may be passed to this prompt template.

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

Returns:

A list of pairs where the first element represents the index of the document used to generate the question in the documents list and the second element represents a synthetically generated closed Q&A prompt. Example: [(0, “Summarize this document”), …]

async run_math_pipeline(
n_macro_topics: str | int,
school_level: str,
n_subtopics: str | int,
n_openlines: str | int,
model: str,
macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass the mathematics knowledge taughted in {school_level}? Your answer should be a list of topics. Make the topics as diverse as possible.',
subtopic_prompt_template: str = 'List {n_subtopics} mathemathics topics that encompass various aspects of "{macro_topic}". Your answer should be a list of topics. Make the topics as diverse as possible.',
math_problem_prompt_template: str = 'Generate {n_openlines} mathematics problems which are related to "{topic}" or can be addressed using "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
additional_macro_topics: List[str] = [],
additional_subtopics: List[str] = [],
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) List[str]#

Runs a pipeline for automatically generating math questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param school_level: The school level to target when generating macro topics. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function No additional parameters may be passed to this prompt template.

  • subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.

  • math_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

  • combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated math prompts

async run_open_qa_pipeline(
n_macro_topics: str | int,
n_subtopics: str | int,
n_openlines: str | int,
n_revisions: str | int,
model: str,
macro_topic_prompt_template: str = 'Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible.For example, 1. Food and drinks. \n2. Technology.\n',
subtopic_prompt_template: str = 'Can you generate {n_subtopics} comprehensive topics that encompass various aspects of {macro_topic}? Your answer should be a list of topics. Make the topics as diverse as possible.',
open_qa_from_topics_prompt_template: str = 'Can you generate {n_openlines} questions or requests related to {topic}? The questions and requests should be as diverse possible. Your answer should be a list.',
revise_open_qa_prompt_template: str = 'Question: {openline}\n\nCan you revise the question above to include more contexts or details? The revised questions can be any of the follows:\n1. Adding some context to the original question. The context might state the importance of the question, explain background knowledge, or add other reasonable information.\n2. Change the questions into a different format or style, e.g., imperative statements, length requirements for the answer, etc.\n3. Elongated questions that require to elaborate on specific topic or discuss a certain point.\n4. Any other related questions or statements.\n\nThe revised question should contain two, three, or four sentences. You should generate {n_revisions} revised questions or statements in a list. Make them as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
additional_macro_topics: List[str] = [],
additional_subtopics: List[str] = [],
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) List[str]#

Runs a pipeline for automatically generating Open Q&A openlines for a dialogue :param n_macro_topics: The number of macro topics to generate :param n_subtopics: The number of subtopics to generate per macro topic :param n_openlines: The number of questions to generate per topic. :param n_revisions: The number of revisions to generate per original question. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.

  • subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.

  • open_qa_from_topics_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template.

  • revise_open_qa_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with a generated open Q&A openline No additional parameters may be passed to this prompt template.

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

  • combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated open Q&A prompts

async run_python_pipeline(
n_macro_topics: str | int,
n_subtopics: str | int,
n_openlines: str | int,
model: str,
macro_topic_prompt_template: str = 'List {n_macro_topics} important concepts in the python language.',
subtopic_prompt_template: str = 'List {n_subtopics} important concepts related to "{macro_topic}" in the python language.',
python_problem_prompt_template: str = 'Generate {n_openlines} {language} coding problems related to "{topic}". These problems should be suitable for beginners who just learnt "{topic}". Your answer should be a list of problems. Make them as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
additional_macro_topics: List[str] = [],
additional_subtopics: List[str] = [],
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) List[str]#

Runs a pipeline for automatically generating Python questions for a dialogue :param n_macro_topics: The number of macro topics to generate. :param n_subtopics: The number of subtopics to generate per macro topic. :param n_openlines: The number of questions to generate per topic. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • macro_topic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template.

  • subtopic_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template.

  • python_problem_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - language: Will be populated with “Python” - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

  • combine_topics – If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used.

Returns:

A list of synthetically generated Python prompts

async run_writing_pipeline(
topics: List[str],
text_material_types: List[str],
n_openlines: str | int,
n_revisions: str | int,
model: str,
writing_task_prompt_template: str = 'Can you generate {n_openlines} tasks, each of which requires to create a "{text_material_type}" related to {topic}? Each task should be concise and include one or two sentences only. The tasks should be as diverse as possible. Your answer should be a list of tasks.',
revise_writing_task_prompt_template: str = 'TASK: {openline}\n\nCan you revise the task above to include more detailed requirements? These requirements can be any of the follows:\n1. Require to elaborate on a specific topic or discuss a certain point.\n2. Require to include some examples, data points, or references.\n3. Require to follow specific formats or styles, e.g., no more than 300 words, including specific words, etc.\n4. Any other reasonable requests to make the task more detailed.\n\nThe revised task should contain two, three, or four sentences. You should generate {n_revisions} revised tasks in a list. Make the tasks as diverse as possible.',
yaml_conversion_prompt_template: str = 'The following document contains a list of items. Parse the list of items into a yaml list of strings. Do not parse any other part of the document. There should be no additional formatting to your response, just the yaml list of strings.\n\n {llm_response}',
base_model_kwargs: dict = {},
conversion_model_kwargs: dict = {},
ignore_conversion_failure: bool = False,
) List[str]#

Runs a pipeline for automatically generating writing task openlines for a dialogue :param topics: A list of topics to generate tasks for :param text_material_types: A list of writing material types, like “Essay” or “Blog post” :param n_openlines: The number of tasks to generate per (topic, text_material_type) pair. :param n_revisions: The number of revisions to generate per original task. :param model: The name of the model that should be used to generate all the responses.

Must be available in the LLMClient passed in the constructor.

Parameters:
  • writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with one element of the topics list passed in this function - text_material_type: Will be populated with one element of the text_material_types list passed in this function No additional parameters may be passed to this prompt template.

  • revise_writing_task_prompt_template – A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with one of the writing tasks generated in the pipeline. No additional parameters may be passed to this prompt template.

  • yaml_conversion_prompt_template – A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template.

  • base_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline.

  • conversion_model_kwargs – Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline.

  • ignore_conversion_failure – Ignores yaml conversion failures when able and discards the data that conversion was attempted on

Returns:

A list of synthetically generated writing task prompts

class nemo_curator.synthetic.NemotronFormatter#
static format_conversation(conv: List[dict]) str#

Formats a converstation between a user and assistant in the Nemotron 340B format described here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct :param conv: A conversation between a user and assistant

Returns:

A conversation formatted as text

class nemo_curator.synthetic.Mixtral8x7BFormatter#
static format_conversation(
conv: List[dict],
) str#

Formats a converstation between a user and assistant in the Mixtral-8x7B format described here: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 :param conv: A conversation between a user and assistant

Returns:

A conversation formatted as text

class nemo_curator.synthetic.NoFormat#
class nemo_curator.synthetic.YamlConversionError(message)#