synthetic.nemotron#

Module Contents#

Classes#

NemotronFormatter

Represents a way of formatting a conversation with an LLM such that it can response appropriately

NemotronGenerator

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

API#

class synthetic.nemotron.NemotronFormatter#

Bases: nemo_curator.services.conversation_formatter.ConversationFormatter

Represents a way of formatting a conversation with an LLM such that it can response appropriately

PROMPT_PREFIX = <Multiline-String>#
static format_conversation(conv: list[dict]) str#

Formats a converstation between a user and assistant in the Nemotron 340B format described here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct Args: conv: A conversation between a user and assistant Returns: A conversation formatted as text

class synthetic.nemotron.NemotronGenerator(
llm_client: nemo_curator.services.model_client.LLMClient,
)#

Provides a collection of methods for generating synthetic data described in the Nemotron-4 340B Technical Report (https://arxiv.org/abs/2406.11704v1) and inspired by the UltraChat paper (https://arxiv.org/abs/2305.14233)

Initialization

classify_math_entity(
entity: str,
model: str,
prompt_template: str = DEFAULT_MATH_CLASSIFICATION_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to classify if an entity is related to math Args: entity: The entity to classify model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

classify_python_entity(
entity: str,
model: str,
prompt_template: str = DEFAULT_PYTHON_CLASSIFICATION_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to classify if an entity is related to Python Args: entity: The entity to classify model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - entity: Will be populated with the entity passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

convert_response_to_yaml_list(
llm_response: str,
model: str,
prompt_template: str = DEFAULT_YAML_CONVERSION_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Converts a response of an LLM to a list of strings by querying an LLM Args: llm_response: The original unformatted response of the LLM model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have a {llm_response} parameter that will be populated with the llm_response value passed in this function. prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A parsed list of elements from the original LLM response

generate_closed_qa_instructions(
document: str,
n_openlines: str | int,
model: str,
prompt_template: str = DEFAULT_CLOSED_QA_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of closed Q&A questions based on a reference document Args: document: The document to use when generating questions n_openlines: The number of questions to generate per document. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - document: Will be populated with the document passed in this function - n_openlines: Will be populated with the n_openlines passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_dialogue(
openline: str,
user_model: str,
assistant_model: str,
n_user_turns: int = 3,
prompt_template: str = DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
user_model_kwargs: dict | None = None,
assistant_model_kwargs: dict | None = None,
) list[dict]#

Prompts an LLM to generate a dialogue based on a given openline. The LLM will alternate impersonating the user and the assistant. Args: openline: The openline that will comprise the first user turn. user_model: The model that will be impersonating the user. Must be available in the LLMClient passed in the constructor. assistant_model: The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor. n_user_turns: The number of user turns to go through. The openline counts as 1 user turn. Therefore, if there are 3 user turns, 2 will be generated by the LLM impersonating the user. prompt_template: A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. user_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user. assistant_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant. Returns: A conversation between a User and Assistant

generate_macro_topics(
n_macro_topics: int | str,
model: str,
prompt_template: str = DEFAULT_MACRO_TOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of macro topics about the world Args: n_macro_topics: The number of macro topics to generate. model: The name of the model that should be used to generate the macro topics. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_macro_topics(
n_macro_topics: int | str,
school_level: str,
model: str,
prompt_template: str = DEFAULT_MATH_MACRO_TOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of macro topics about math Args: n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. school_level: The school level the math questions should be targeted at. model: The name of the model that should be used to generate the macro topics. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_problem(
topic: str,
n_openlines: str | int,
model: str,
prompt_template: str = MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of math problems based on a topic Args: topic: The topic to generate problems for. n_openlines: The number of problems to generate per topic. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_math_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = DEFAULT_MATH_SUBTOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of subtopics relating to a math macro topic Args: macro_topic: The macro topic to generate subtopics for. n_subtopics: The number of subtopics to generate per macro topic model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_open_qa_from_topic(
topic: str,
n_openlines: str | int,
model: str,
prompt_template: str = DEFAULT_OPEN_QA_FROM_TOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of open Q&A questions based on a topic Args: topic: The topic to generate questions for. n_openlines: The number of questions to generate per topic. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_macro_topics(
n_macro_topics: int | str,
model: str,
prompt_template: str = DEFAULT_PYTHON_MACRO_TOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of macro topics about the Python programming language Args: n_macro_topics: The number of macro topics to generate. Can be an integer like 5 or a string like “five”. model: The name of the model that should be used to generate the macro topics. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_problem(
topic: str,
n_openlines: str | int,
model: str,
language: str = 'Python',
prompt_template: str = PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of coding problems based on a topic Args: topic: The topic to generate problems for. n_openlines: The number of problems to generate per topic. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. language: The programming language to target when generating these questions. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_subtopics passed in this function - topic: Will be populated with the topic passed in this function - language: Will be populated with the language passed in this function Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_python_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = DEFAULT_PYTHON_SUBTOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of subtopics relating to a Python macro topic Args: macro_topic: The macro topic to generate subtopics for. n_subtopics: The number of subtopics to generate per macro topic model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_subtopics(
macro_topic: str,
n_subtopics: int | str,
model: str,
prompt_template: str = DEFAULT_SUBTOPICS_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of subtopics relating to a macro topic Args: macro_topic: The macro topic to generate subtopics for. n_subtopics: The number of subtopics to generate per macro topic model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with the macro_topic passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

generate_two_turn_prompt(
openline: str,
user_model: str,
assistant_model: str,
prompt_template: str = DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
user_model_kwargs: dict | None = None,
assistant_model_kwargs: dict | None = None,
) list[dict]#

Prompts an LLM to generate a response as an assistant, then as the user based on a given openline. The conversation will look like “User -> Assistant -> User” Args: openline: The openline that will comprise the first user turn. user_model: The model that will be impersonating the user. Must be available in the LLMClient passed in the constructor. assistant_model: The model that will be impersonating the assistant Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use when impersonating the user. It must have the following parameters: - converstation_history: Will be populated with a formatted history of the dialogue up to that point. Some example templates found in nemo_curator.synthetic include: - DIALOGUE_NORMAL_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_COMPLEX_USER_TURN_PROMPT_TEMPLATE - DIALOGUE_CONCISE_USER_TURN_PROMPT_TEMPLATE prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. user_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the user. assistant_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the assistant. Returns: A conversation between a User and Assistant

generate_writing_tasks(
topic: str,
text_material_type: str,
n_openlines: str | int,
model: str,
prompt_template: str = DEFAULT_WRITING_TASK_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to generate a list of writing tasks based on a topic and document type Args: topic: The topic to generate writing tasks for. text_material_type: The type of the document the question should ask to generate (e.g., “Email”, “Poem”) n_openlines: The number of tasks to generate per topic and text material pair. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - topic: Will be populated with the topic passed in this function - text_material_type: Will be populated with the text_material_type passed in this function - n_openlines: Will be populated with the n_openlines passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

revise_open_qa(
openline: str,
n_revisions: str | int,
model: str,
prompt_template: str = DEFAULT_REVISE_OPEN_QA_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to revise an open Q&A question a given number of times Args: openline: An openline to revise n_revisions: The number of revisions to generate for the question. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

revise_writing_tasks(
openline: str,
n_revisions: str | int,
model: str,
prompt_template: str = DEFAULT_REVISE_WRITING_TASK_PROMPT_TEMPLATE,
prompt_kwargs: dict | None = None,
model_kwargs: dict | None = None,
) list[str]#

Prompts an LLM to revise a writing task a given number of times Args: openline: An openline to revise n_revisions: The number of revisions to generate for the task. model: The name of the model that should be used to generate the response. Must be available in the LLMClient passed in the constructor. prompt_template: A format string of the prompt to use. It must have the following parameters: - openline: Will be populated with the openline passed in this function - n_revisions: Will be populated with the n_revisions passed in this function prompt_kwargs: Any additional keyword arguments that should be passed to the prompt template. None are needed for the default template. model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call. Returns: A list of responses from the LLM. The list is only greater than length 1 if n > 1 is set in model_kwargs.

run_closed_qa_pipeline(
documents: list[str],
n_openlines: str | int,
model: str,
closed_qa_prompt_template: str = DEFAULT_CLOSED_QA_PROMPT_TEMPLATE,
yaml_conversion_prompt_template: str = DEFAULT_YAML_CONVERSION_PROMPT_TEMPLATE,
base_model_kwargs: dict | None = None,
conversion_model_kwargs: dict | None = None,
ignore_conversion_failure: bool = False,
) list[tuple[int, str]]#

Runs a pipeline for automatically generating closed Q&A openlines for a dialogue Args: documents: A list of documents to generate closed Q&A questions for n_openlines: The number of questions to generate per document. model: The name of the model that should be used to generate all the responses. Must be available in the LLMClient passed in the constructor. closed_qa_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - document: Will be populated with one element of the documents list passed in this function No additional parameters may be passed to this prompt template. yaml_conversion_prompt_template: A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template. base_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline. conversion_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline. ignore_conversion_failure: Ignores yaml conversion failures when able and discards the data that conversion was attempted on Returns: A list of pairs where the first element represents the index of the document used to generate the question in the documents list and the second element represents a synthetically generated closed Q&A prompt. Example: [(0, “Summarize this document”), …]

run_math_pipeline(
n_macro_topics: str | int,
school_level: str,
n_subtopics: str | int,
n_openlines: str | int,
model: str,
macro_topic_prompt_template: str = DEFAULT_MATH_MACRO_TOPICS_PROMPT_TEMPLATE,
subtopic_prompt_template: str = DEFAULT_MATH_SUBTOPICS_PROMPT_TEMPLATE,
math_problem_prompt_template: str = MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE,
yaml_conversion_prompt_template: str = DEFAULT_YAML_CONVERSION_PROMPT_TEMPLATE,
base_model_kwargs: dict | None = None,
conversion_model_kwargs: dict | None = None,
additional_macro_topics: list[str] | None = None,
additional_subtopics: list[str] | None = None,
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) list[str]#

Runs a pipeline for automatically generating math questions for a dialogue Args: n_macro_topics: The number of macro topics to generate. school_level: The school level to target when generating macro topics. n_subtopics: The number of subtopics to generate per macro topic. n_openlines: The number of questions to generate per topic. model: The name of the model that should be used to generate all the responses. Must be available in the LLMClient passed in the constructor. macro_topic_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function - school_level: Will be populated with the school_level passed in this function No additional parameters may be passed to this prompt template. subtopic_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template. math_problem_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE - MATH_PROBLEM_BEGINNER_PROMPT_TEMPLATE yaml_conversion_prompt_template: A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template. base_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline. conversion_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline. ignore_conversion_failure: Ignores yaml conversion failures when able and discards the data that conversion was attempted on combine_topics: If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used. Returns: A list of synthetically generated math prompts

run_open_qa_pipeline(
n_macro_topics: str | int,
n_subtopics: str | int,
n_openlines: str | int,
n_revisions: str | int,
model: str,
macro_topic_prompt_template: str = DEFAULT_MACRO_TOPICS_PROMPT_TEMPLATE,
subtopic_prompt_template: str = DEFAULT_SUBTOPICS_PROMPT_TEMPLATE,
open_qa_from_topics_prompt_template: str = DEFAULT_OPEN_QA_FROM_TOPICS_PROMPT_TEMPLATE,
revise_open_qa_prompt_template: str = DEFAULT_REVISE_OPEN_QA_PROMPT_TEMPLATE,
yaml_conversion_prompt_template: str = DEFAULT_YAML_CONVERSION_PROMPT_TEMPLATE,
base_model_kwargs: dict | None = None,
conversion_model_kwargs: dict | None = None,
additional_macro_topics: list[str] | None = None,
additional_subtopics: list[str] | None = None,
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) list[str]#

Runs a pipeline for automatically generating Open Q&A openlines for a dialogue Args: n_macro_topics: The number of macro topics to generate n_subtopics: The number of subtopics to generate per macro topic n_openlines: The number of questions to generate per topic. n_revisions: The number of revisions to generate per original question. model: The name of the model that should be used to generate all the responses. Must be available in the LLMClient passed in the constructor. macro_topic_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template. subtopic_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template. open_qa_from_topics_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. revise_open_qa_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with a generated open Q&A openline No additional parameters may be passed to this prompt template. yaml_conversion_prompt_template: A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template. base_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline. conversion_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline. ignore_conversion_failure: Ignores yaml conversion failures when able and discards the data that conversion was attempted on combine_topics: If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used. Returns: A list of synthetically generated open Q&A prompts

run_python_pipeline(
n_macro_topics: str | int,
n_subtopics: str | int,
n_openlines: str | int,
model: str,
macro_topic_prompt_template: str = DEFAULT_PYTHON_MACRO_TOPICS_PROMPT_TEMPLATE,
subtopic_prompt_template: str = DEFAULT_PYTHON_SUBTOPICS_PROMPT_TEMPLATE,
python_problem_prompt_template: str = PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE,
yaml_conversion_prompt_template: str = DEFAULT_YAML_CONVERSION_PROMPT_TEMPLATE,
base_model_kwargs: dict | None = None,
conversion_model_kwargs: dict | None = None,
additional_macro_topics: list[str] | None = None,
additional_subtopics: list[str] | None = None,
ignore_conversion_failure: bool = False,
combine_topics: bool = True,
) list[str]#

Runs a pipeline for automatically generating Python questions for a dialogue Args: n_macro_topics: The number of macro topics to generate. n_subtopics: The number of subtopics to generate per macro topic. n_openlines: The number of questions to generate per topic. model: The name of the model that should be used to generate all the responses. Must be available in the LLMClient passed in the constructor. macro_topic_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_macro_topics: Will be populated with the n_macro_topics passed in this function No additional parameters may be passed to this prompt template. subtopic_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_subtopics: Will be populated with the n_subtopics passed in this function - macro_topic: Will be populated with a generated macro topic No additional parameters may be passed to this prompt template. python_problem_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - language: Will be populated with “Python” - topic: Will be populated with a generated topic No additional parameters may be passed to this prompt template. Some example templates found in nemo_curator.synthetic include: - PYTHON_PROBLEM_BEGINNER_PROMPT_TEMPLATE - PYTHON_PROBLEM_INTERMEDIATE_PROMPT_TEMPLATE - PYTHON_PROBLEM_ADVANCED_PROMPT_TEMPLATE yaml_conversion_prompt_template: A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template. base_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline. conversion_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline. ignore_conversion_failure: Ignores yaml conversion failures when able and discards the data that conversion was attempted on combine_topics: If True, mixes the macro topics with the subtopics when generating openlines. If False, only the subtopics are used. Returns: A list of synthetically generated Python prompts

run_writing_pipeline(
topics: list[str],
text_material_types: list[str],
n_openlines: str | int,
n_revisions: str | int,
model: str,
writing_task_prompt_template: str = DEFAULT_WRITING_TASK_PROMPT_TEMPLATE,
revise_writing_task_prompt_template: str = DEFAULT_REVISE_WRITING_TASK_PROMPT_TEMPLATE,
yaml_conversion_prompt_template: str = DEFAULT_YAML_CONVERSION_PROMPT_TEMPLATE,
base_model_kwargs: dict | None = None,
conversion_model_kwargs: dict | None = None,
ignore_conversion_failure: bool = False,
) list[str]#

Runs a pipeline for automatically generating writing task openlines for a dialogue Args: topics: A list of topics to generate tasks for text_material_types: A list of writing material types, like “Essay” or “Blog post” n_openlines: The number of tasks to generate per (topic, text_material_type) pair. n_revisions: The number of revisions to generate per original task. model: The name of the model that should be used to generate all the responses. Must be available in the LLMClient passed in the constructor. writing_task_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_openlines: Will be populated with the n_openlines passed in this function - topic: Will be populated with one element of the topics list passed in this function - text_material_type: Will be populated with one element of the text_material_types list passed in this function No additional parameters may be passed to this prompt template. revise_writing_task_prompt_template: A format string of the prompt to use. It must have the following parameters: - n_revisions: Will be populated with the n_revisions passed in this function - openline: Will be populated with one of the writing tasks generated in the pipeline. No additional parameters may be passed to this prompt template. yaml_conversion_prompt_template: A format string of the prompt to use. It must have the following parameters: - llm_response: Will be populated with the raw LLM response from each stage of the pipeline No additional parameters may be passed to this prompt template. base_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the normal stages of the pipeline. conversion_model_kwargs: Any additional keyword arguments that should be passed to the LLMClient.query_model call for the yaml conversion stages of the pipeline. ignore_conversion_failure: Ignores yaml conversion failures when able and discards the data that conversion was attempted on Returns: A list of synthetically generated writing task prompts