Wikipedia Style Rewrite Pipeline#

This pipeline rewrites documents into a style similar to Wikipedia, improving line spacing, punctuation, and scholarly tone. The pipeline uses language models to transform low-quality text into well-formatted, encyclopedia-style content that’s more suitable for training datasets.

Before You Start#

  • LLM Client Setup: The NemotronCCGenerator requires an LLMClient instance to interface with language models. Refer to the LLM services documentation for details on configuring your client with specific model providers.


Setup Steps#

Set up the LLM Client#

Configure your LLM client (example with OpenAI):

from openai import OpenAI

openai_client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key="<insert NVIDIA API key>"
)

Create the NeMo Curator Client Wrapper#

Wrap the client with NeMo Curator’s client wrapper:

from nemo_curator import OpenAIClient

client = OpenAIClient(openai_client)

Initialize the Generator#

Create the NemotronCCGenerator instance:

from nemo_curator.synthetic import NemotronCCGenerator

generator = NemotronCCGenerator(client)

Configure Generation Parameters#

Set up your model and generation parameters:

model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
    "temperature": 0.5,
    "top_p": 0.9,
    "max_tokens": 512,
}

Rewrite Documents to Wikipedia Style#

Use the generator to transform text into Wikipedia-style content:

Python Example
document = "The moon is bright. It shines at night."

responses = generator.rewrite_to_wikipedia_style(
    document=document, 
    model=model, 
    model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The lunar surface has a high albedo, which means it reflects a significant amount of sunlight.

Note

The output shown is illustrative. Actual outputs will vary based on the input text and model parameters.