Use Reward Models with NVIDIA NIM for LLMs#

NIM for LLMs supports deploying large language reward models, in addition to chat and completion models. Reward models are often used to score the outputs of another large language model for fine-tuning that model or filtering synthetically created datasets.

When deploying NIMs with reward models, specify the environment variables NIM_REWARD_MODEL, NIM_REWARD_LOGITS_RANGE, and NIM_REWARD_MODEL_STRING as described in Environment Variables.

To send text to a reward model, use the chat/completions endpoint. Include the prompt that was used to generate the text as the first user content, and include the response from the model as the assistant content. The reward model scores the provided model response, taking into account the query that generated it.

Review the prerequisites and common setup steps before deploying a reward model. For information about supported configurations, refer to supported models.

Deploy a Reward Model#

This example uses the HuggingFace endpoint for the Llama-3.3-Nemotron-70B-Reward-Principle model and deploys it using the multi-LLM compatible NIM container:

# Choose a container name for bookkeeping
export CONTAINER_NAME=reward-nim

# Set the multi-LLM NIM repository
export Repository=nim/nvidia/llm-nim

# Set the tag to latest or a specific version (for example, 1.15.0)
export TAG=latest

# Choose the multi-LLM NIM image from NGC
export IMG_NAME="nvcr.io/$Repository:$TAG"

# Set HF_TOKEN for downloading HuggingFace repository
export HF_TOKEN=hf_xxxxxx

# Choose the reward model from HuggingFace
export NIM_MODEL_NAME=hf://nvidia/Llama-3.3-Nemotron-70B-Reward-Principle

# Choose a served model name
export NIM_SERVED_MODEL_NAME=nvidia/Llama-3.3-Nemotron-70B-Reward-Principle

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Add write permissions to the NIM cache for downloading model assets
chmod -R a+w "$LOCAL_NIM_CACHE"

# Start the LLM NIM with reward model
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e HF_TOKEN=$HF_TOKEN \
  -e NIM_MODEL_NAME=$NIM_MODEL_NAME \
  -e NIM_SERVED_MODEL_NAME=$NIM_SERVED_MODEL_NAME \
  -e NIM_REWARD_MODEL=1 \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Usage Example#

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

original_query = "I am going to Paris, what should I see?"
original_response = "Ah, Paris, the City of Light! There are so many amazing things to see and do in this beautiful city ..."

messages = [
    {"role": "user", "content": original_query},
    {"role": "assistant", "content": original_response}
]

response = client.chat.completions.create(
    model="nvidia/Llama-3.3-Nemotron-70B-Reward-Principle",
    messages=messages,
    stream=False
)

# Parse the reward scores
response_content = response.choices[0].message.content
reward_pairs = [pair.split(":") for pair in response_content.split(",")]
reward_dict = {attribute: float(score) for attribute, score in reward_pairs}
print(reward_dict)

Expected Output:

{'helpfulness': 1.2578125, 'correctness': 0.43359375, 'coherence': 3.34375, 'complexity': 0.045166015625, 'verbosity': 0.6953125}

The response from NIM includes attribute and score pairs in the message content, where a regular chat completion model would return its generated text. The attributes that a reward model scores responses on are specific to each reward model. The Reward model in the example above is trained using the HelpSteer3 dataset, and scores responses according to the following metrics:

  • Helpfulness

  • Correctness

  • Coherence

  • Complexity

  • Verbosity