Prompt Optimization Task#

Build confidence in LLM-as-a-Judge evaluations using prompt optimization to improve judge prompt and evaluate its effectiveness.

Prerequisites#

  • A model target. Refer to LLM Model Endpoint for more information.

  • Labeled dataset for prompt optimization in JSON or JSONL format.

  • Uploaded your dataset to NeMo Data Store using Hugging Face CLI or SDK.


Optimizer Types#

Optimizer types for Prompt Optimization#

Type

Library

Description

MIPROv2

DSPy

Optimize instruction and few-shot examples or instruction only (0-shot) using Bayesian Optimization.


MIPROv2#

Use MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2) to optimize an LLM prompt using Bayesian Optimization to propose instructions and few-shot example candidates that are designed around the dynamic of the task. The miprov2 task with Evaluator is powered by DSPy library.

Any metric available for Custom Evaluation Metrics are compatible with prompt optimization. MIPROv2 requires a metric that computes a single boolean score. For metrics that compute a float score, Set metric_threshold when using a metric that computes a non-boolean score and metric_threshold_score to the score to use for evaluation when using a metric that computes multiple scores. See MIPROv2 Optimizer Parameters for more information.

The following configuration is an example for optimizing an LLM-as-a-Judge prompt. The instruction is the initial judge prompt tasked to score the similarity of a golden example reference and a cached output from another model model_output. For this use-case, number-check metric is used as the optimization metric to compare the labeled similarity score answer, from the dataset, with the generated similarity score, from the target, is within an acceptable threshold. Iterate on the prompt optimization until you are satisfied with the generated prompt and evaluation score. The resulting optimized prompt can then be used in the template for LLM-as-a-Judge evaluation.

Your job can have a different initial instruction and signature. The signature is required to define semantic roles for inputs and outputs of your custom dataset. Modify the metrics template accordingly to your dataset.

{
  "type": "custom",
  "tasks": {
    "llm-judge-prompt": {
      "type": "prompt-optimization",
      "params": {
        "optimizer": {
          "type": "miprov2",
          "instruction": "Your task is to evaluate the semantic similarity between two responses using a score between 0 and 10. Respond in the following format SIMILARITY: 4.",
          "signature": "question, reference, model_output -> similarity_score: int"
        }
      },
      "metrics": {
        "number-check": {
        "type": "number-check",
          "params": {
            "check": [
              "absolute difference",
              "{{item.similarity_score | trim}}",
              "{{similarity_score | trim}}",
              "epsilon",
              1
            ]
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/<namespace>/<name>/<file-path>"
      }
    }
  }
}

Labeled dataset for prompt optimization in JSON or JSONL format and must contain at least 2 examples.

{
  "question": "What is the capital of France?",
  "reference": "Paris",
  "model_output": "Paris",
  "similarity_score": "10"
}

Example result with few-shot examples included in optimized_prompt.

{
  "tasks": {
    "llm-judge-prompt": {
      "metrics": {
        "miprov2": {
          "scores": {
            "baseline": {
              "value": 0.263
            },
            "optimized": {
              "value": 0.667
            }
          }
        }
      },
      "data": {
        "baseline_prompt": "Your task is to evaluate the semantic similarity between two responses using a score between 0 and 10. Respond in the following format SIMILARITY: 4.",
        "optimized_prompt": "Evaluate the semantic similarity between two responses using a score between 0 and 10. Recommend a prompt that prompts the Language Model to provide a score that is closer to 0 to highlight the difference between the two responses.\n{\"question\":\"What is the capital of France?\",\"reference\":\"Paris\",\"model_output\":\"Paris\",\"similarity_score\":\"10\"}\n{\"question\":\"What is breve coffee?\",\"reference\":\"a coffee drink made with espresso and steamed half-and-half instead of milk\",\"model_output\":\"Cafe au Lait\",\"similarity_score\":\"1\"}"
      }
    }
  }
}

Parameters#

Target Model Parameters#

These parameters control the target model’s generation behavior:

Name

Description

Type

Default

Valid Values

max_tokens

Maximum number of tokens to generate.

Integer

6144

temperature

Sampling temperature for generation.

Float

0.5

0.0–2.0

top_p

Nucleus sampling parameter.

Float

0.95

0.01–1.0

stop

Stop generating further tokens when string is generated.

Array of strings

-

MIPROv2 Optimizer Parameters#

Full list of parameters supported for MIPROv2 can be viewed on dspy.MIPROv2 library documentation. Set these parameters in the task.params.optimizer section:

Name

Description

Type

Default

auto

The optimization intensity to control the number of optimization trials and other internal parameters. The auto parameter can be set to light, medium, or heavy. Heavier settings perform more optimization trials but require more computation.

String

light

instruction

The initial instruction to optimize.

String

max_bootstrapped_demos

The maximum number of examples to generate using your program.

Integer

4

max_labeled_demos

The maximum number of examples to use directly from your training set.

Integer

4

metric_threshold

The threshold value is used to evaluate for optimization. Required for metrics that compute float scores.

Float

metric_threshold_score

Specify which score is used to evaluate for optimization. Required for metrics that compute multiple scores.

String

seed

A seed for the algorithm and dataset split.

Integer

9

signature

Inline DSPy signature is required to define semantic roles for inputs/outputs. The signature must match the dataset structure and follow the DSPy format with -> separating inputs and outputs. E.g. context, question -> answer

String

Required

Metrics#

Metrics in Prompt Optimization#

Metric Name

Description

Value Range

baseline

The accuracy score evaluating with the provided instruction and configured metric.

0.0–1.0

optimized

The accuracy score evaluating with the optimized instruction and configured metric.

0.0–1.0

Results Data#

Results Data in Prompt Optimization#

Data

Description

Type

baseline_prompt

The provided instruction.

String

optimized_prompt

The optimized instruction (with few-shot examples) generated from prompt optimization.

String