Wikipedia Style Rewrite Pipeline#
This pipeline rewrites documents into a style similar to Wikipedia, improving line spacing, punctuation, and scholarly tone. The pipeline uses language models to transform low-quality text into well-formatted, encyclopedia-style content that’s more suitable for training datasets.
Before You Start#
LLM Client Setup: The
NemotronCCGenerator
requires anLLMClient
instance to interface with language models. Refer to the LLM services documentation for details on configuring your client with specific model providers.
Setup Steps#
Set up the LLM Client#
Configure your LLM client (example with OpenAI):
from openai import OpenAI
openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
Create the NeMo Curator Client Wrapper#
Wrap the client with NeMo Curator’s client wrapper:
from nemo_curator import OpenAIClient
client = OpenAIClient(openai_client)
Initialize the Generator#
Create the NemotronCCGenerator instance:
from nemo_curator.synthetic import NemotronCCGenerator
generator = NemotronCCGenerator(client)
Configure Generation Parameters#
Set up your model and generation parameters:
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 512,
}
Rewrite Documents to Wikipedia Style#
Use the generator to transform text into Wikipedia-style content:
Python Example
document = "The moon is bright. It shines at night."
responses = generator.rewrite_to_wikipedia_style(
document=document,
model=model,
model_kwargs=model_kwargs
)
print(responses[0])
# Output:
# The lunar surface has a high albedo, which means it reflects a significant amount of sunlight.
Note
The output shown is illustrative. Actual outputs will vary based on the input text and model parameters.