Using Reward Models#
NIM LLM supports deploying Large Language reward models, in addition to chat and completion models. Reward models are often used to score the outputs of another large language model for further fine tuning that model or filtering synthetically created datasets.
To send text to a reward model, you can use the chat/completions endpoint like other kinds of models. Include the prompt that was used to generate the text as the first user content, and the response from the model as the assistant content. The reward model will score the provided model response, taking into account the query that generated it. For example:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
original_query = "I am going to Paris, what should I see?"
original_response = "Ah, Paris, the City of Light! There are so many amazing things to see and do in this beautiful city ..."
messages = [
    {"role": "user", "content": original_query},
    {"role": "assistant", "content": original_response}
]
response = client.chat.completions.create(
    model="nvidia/nemotron-4-340b-reward",
    messages=messages,
    stream=False
)
The response from NIM will include attribute and score pairs in the message content, where a regular chat completion model would return its generated text. The attributes that a reward model scores responses on are specific to each reward model. Reward models that are trained using the HelpSteer dataset (like nemotron-4-340b) score responses according to the following metrics:
- Helpfulness 
- Correctness 
- Coherence 
- Complexity 
- Verbosity 
You can use this response in your downstream applications. For example, you may want to parse the scores into a python dictionary:
response_content = response.choices[0].message.content
reward_pairs = [pair.split(":") for pair in response_content.split(",")]
reward_dict = {attribute: float(score) for attribute, score in reward_pairs}
print(reward_dict)
# Prints:
# {'helpfulness': 1.2578125, 'correctness': 0.43359375, 'coherence': 3.34375, 'complexity': 0.045166015625, 'verbosity': 0.6953125}