Prompt Optimization Task#
Build confidence in LLM-as-a-Judge evaluations using prompt optimization to improve judge prompt and evaluate its effectiveness.
Prerequisites#
A model target. Refer to LLM Model Endpoint for more information.
Labeled dataset for prompt optimization in JSON or JSONL format.
Uploaded your dataset to NeMo Data Store using Hugging Face CLI or SDK.
Optimizer Types#
MIPROv2#
Use MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2) to optimize an LLM prompt using Bayesian Optimization to propose instructions and few-shot example candidates that are designed around the dynamic of the task. The miprov2
task with Evaluator is powered by DSPy library.
Any metric available for Custom Evaluation Metrics are compatible with prompt optimization. MIPROv2 requires a metric that computes a single boolean score. For metrics that compute a float score, Set metric_threshold
when using a metric that computes a non-boolean score and metric_threshold_score
to the score to use for evaluation when using a metric that computes multiple scores. See MIPROv2 Optimizer Parameters for more information.
The following configuration is an example for optimizing an LLM-as-a-Judge prompt. The instruction
is the initial judge prompt tasked to score the similarity of a golden example reference
and a cached output from another model model_output
. For this use-case, number-check
metric is used as the optimization metric to compare the labeled similarity score answer
, from the dataset, with the generated similarity score, from the target, is within an acceptable threshold. Iterate on the prompt optimization until you are satisfied with the generated prompt and evaluation score. The resulting optimized prompt can then be used in the template for LLM-as-a-Judge evaluation.
Your job can have a different initial instruction and signature. The signature is required to define semantic roles for inputs and outputs of your custom dataset. Modify the metrics template accordingly to your dataset.
{
"type": "custom",
"tasks": {
"llm-judge-prompt": {
"type": "prompt-optimization",
"params": {
"optimizer": {
"type": "miprov2",
"instruction": "Your task is to evaluate the semantic similarity between two responses using a score between 0 and 10. Respond in the following format SIMILARITY: 4.",
"signature": "question, reference, model_output -> similarity_score: int"
}
},
"metrics": {
"number-check": {
"type": "number-check",
"params": {
"check": [
"absolute difference",
"{{item.similarity_score | trim}}",
"{{similarity_score | trim}}",
"epsilon",
1
]
}
}
},
"dataset": {
"files_url": "hf://datasets/<namespace>/<name>/<file-path>"
}
}
}
}
Labeled dataset for prompt optimization in JSON or JSONL format and must contain at least 2 examples.
{
"question": "What is the capital of France?",
"reference": "Paris",
"model_output": "Paris",
"similarity_score": "10"
}
Example result with few-shot examples included in optimized_prompt
.
{
"tasks": {
"llm-judge-prompt": {
"metrics": {
"miprov2": {
"scores": {
"baseline": {
"value": 0.263
},
"optimized": {
"value": 0.667
}
}
}
},
"data": {
"baseline_prompt": "Your task is to evaluate the semantic similarity between two responses using a score between 0 and 10. Respond in the following format SIMILARITY: 4.",
"optimized_prompt": "Evaluate the semantic similarity between two responses using a score between 0 and 10. Recommend a prompt that prompts the Language Model to provide a score that is closer to 0 to highlight the difference between the two responses.\n{\"question\":\"What is the capital of France?\",\"reference\":\"Paris\",\"model_output\":\"Paris\",\"similarity_score\":\"10\"}\n{\"question\":\"What is breve coffee?\",\"reference\":\"a coffee drink made with espresso and steamed half-and-half instead of milk\",\"model_output\":\"Cafe au Lait\",\"similarity_score\":\"1\"}"
}
}
}
}
Parameters#
Target Model Parameters#
These parameters control the target model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Maximum number of tokens to generate. |
Integer |
6144 |
— |
|
Sampling temperature for generation. |
Float |
0.5 |
|
|
Nucleus sampling parameter. |
Float |
0.95 |
|
|
Stop generating further tokens when string is generated. |
Array of strings |
- |
— |
MIPROv2 Optimizer Parameters#
Full list of parameters supported for MIPROv2 can be viewed on dspy.MIPROv2 library documentation. Set these parameters in the task.params.optimizer
section:
Name |
Description |
Type |
Default |
---|---|---|---|
|
The optimization intensity to control the number of optimization trials and other internal parameters. The auto parameter can be set to |
String |
|
|
The initial instruction to optimize. |
String |
|
|
The maximum number of examples to generate using your program. |
Integer |
4 |
|
The maximum number of examples to use directly from your training set. |
Integer |
4 |
|
The threshold value is used to evaluate for optimization. Required for metrics that compute float scores. |
Float |
|
|
Specify which score is used to evaluate for optimization. Required for metrics that compute multiple scores. |
String |
|
|
A seed for the algorithm and dataset split. |
Integer |
9 |
|
Inline DSPy signature is required to define semantic roles for inputs/outputs. The signature must match the dataset structure and follow the DSPy format with |
String |
Required |
Metrics#
Metric Name |
Description |
Value Range |
---|---|---|
|
The accuracy score evaluating with the provided instruction and configured metric. |
|
|
The accuracy score evaluating with the optimized instruction and configured metric. |
|
Results Data#
Data |
Description |
Type |
---|---|---|
|
The provided instruction. |
String |
|
The optimized instruction (with few-shot examples) generated from prompt optimization. |
String |