Closed Q&A Generation Pipeline#
This pipeline generates closed-ended questions about a given document, as used in Nemotron-4 340B. Closed-ended questions are specific questions that can be answered directly from the provided document content, as opposed to open-ended questions that require broader knowledge.
Setup Steps#
Set up the LLM Client#
Configure your LLM client (example with OpenAI):
from openai import OpenAI
openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
Create the NeMo Curator Client Wrapper#
Wrap the client with NeMo Curator’s client wrapper:
from nemo_curator.services import OpenAIClient
llm_client = OpenAIClient(openai_client)
Initialize the Generator#
Create the NemotronGenerator instance:
from nemo_curator.synthetic import NemotronGenerator
generator = NemotronGenerator(llm_client)
Configure Generation Parameters#
Set up your model and generation parameters:
model = "mistralai/mixtral-8x7b-instruct-v0.1"
document = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal."
n_openlines = 5
Generate Questions from Document#
Use the generator to create closed-ended questions:
closed_qa_responses = generator.generate_closed_qa_instructions(
document=document,
n_openlines=n_openlines,
model=model,
)
# Parse the responses to extract individual questions
closed_qa_questions = generator.convert_response_to_yaml_list(
closed_qa_responses[0],
model=model
)
print(closed_qa_questions[0])
# Output:
# "Which President of the United States gave this speech?"
Run the End-to-End Pipeline#
For processing multiple documents, use the complete pipeline:
documents = [
"Four score and seven years ago our fathers brought forth on this continent...",
"We hold these truths to be self-evident, that all men are created equal...",
# Add more documents as needed
]
closed_qa_questions = generator.run_closed_qa_pipeline(
documents=documents,
n_openlines=n_openlines,
model=model,
)
print(closed_qa_questions[0])
# Output:
# (0, "Which President of the United States gave this speech?")