Prompt Optimization Task#

Build confidence in LLM-as-a-Judge evaluations using prompt optimization to improve judge prompt and evaluate its effectiveness.

Prerequisites#

A model target. Refer to LLM Model Endpoint for more information.
Labeled dataset for prompt optimization in JSON or JSONL format.
Uploaded your dataset to NeMo Data Store using Hugging Face CLI or SDK.

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Optimizer Types#

Optimizer types for Prompt Optimization#
Type	Library	Description
MIPROv2	DSPy	Optimize instruction and few-shot examples or instruction only (0-shot) using Bayesian Optimization.

MIPROv2#

Use MIPROv2 (Multiprompt Instruction PRoposal Optimizer Version 2) to optimize an LLM prompt using Bayesian Optimization to propose instructions and few-shot example candidates that are designed around the dynamic of the task. The miprov2 task with Evaluator is powered by DSPy library.

Any metric available for Custom Evaluation Metrics are compatible with prompt optimization. MIPROv2 requires a metric that computes a single boolean score. For metrics that compute a float score, Set metric_threshold when using a metric that computes a non-boolean score and metric_threshold_score to the score to use for evaluation when using a metric that computes multiple scores. See MIPROv2 Optimizer Parameters for more information.

The following configuration is an example for optimizing an LLM-as-a-Judge prompt. The instruction is the initial judge prompt tasked to score the similarity of a golden example reference and a cached output from another model model_output. For this use-case, number-check metric is used as the optimization metric to compare the labeled similarity score answer, from the dataset, with the generated similarity score, from the target, is within an acceptable threshold. Iterate on the prompt optimization until you are satisfied with the generated prompt and evaluation score. The resulting optimized prompt can then be used in the template for LLM-as-a-Judge evaluation.

Your job can have a different initial instruction and signature. The signature is required to define semantic roles for inputs and outputs of your custom dataset. Modify the metrics template accordingly to your dataset.

Config

{
  "type": "custom",
  "tasks": {
    "llm-judge-prompt": {
      "type": "prompt-optimization",
      "params": {
        "optimizer": {
          "type": "miprov2",
          "instruction": "Your task is to evaluate the semantic similarity between two responses using a score between 0 and 10. Respond in the following format SIMILARITY: 4.",
          "signature": "question, reference, model_output -> similarity_score: int"
        }
      },
      "metrics": {
        "number-check": {
        "type": "number-check",
          "params": {
            "check": [
              "absolute difference",
              "{{item.similarity_score | trim}}",
              "{{similarity_score | trim}}",
              "epsilon",
              1
            ]
          }
        }
      },
      "dataset": {
        "files_url": "hf://datasets/<namespace>/<name>/<file-path>"
      }
    }
  }
}

Data Format

Labeled dataset for prompt optimization in JSON or JSONL format and must contain at least 2 examples.

{
  "question": "What is the capital of France?",
  "reference": "Paris",
  "model_output": "Paris",
  "similarity_score": "10"
}

Result

Example result with few-shot examples included in optimized_prompt.

{
  "tasks": {
    "llm-judge-prompt": {
      "metrics": {
        "miprov2": {
          "scores": {
            "baseline": {
              "value": 0.263
            },
            "optimized": {
              "value": 0.667
            }
          }
        }
      },
      "data": {
        "baseline_prompt": "Your task is to evaluate the semantic similarity between two responses using a score between 0 and 10. Respond in the following format SIMILARITY: 4.",
        "optimized_prompt": "Evaluate the semantic similarity between two responses using a score between 0 and 10. Recommend a prompt that prompts the Language Model to provide a score that is closer to 0 to highlight the difference between the two responses.\n{\"question\":\"What is the capital of France?\",\"reference\":\"Paris\",\"model_output\":\"Paris\",\"similarity_score\":\"10\"}\n{\"question\":\"What is breve coffee?\",\"reference\":\"a coffee drink made with espresso and steamed half-and-half instead of milk\",\"model_output\":\"Cafe au Lait\",\"similarity_score\":\"1\"}"
      }
    }
  }
}

Parameters#

Target Model Parameters#

These parameters control the target model’s generation behavior:

Name	Description	Type	Default	Valid Values
`max_tokens`	Maximum number of tokens to generate.	Integer	6144	—
`temperature`	Sampling temperature for generation.	Float	0.5	`0.0–2.0`
`top_p`	Nucleus sampling parameter.	Float	0.95	`0.01–1.0`
`stop`	Stop generating further tokens when string is generated.	Array of strings	-	—

MIPROv2 Optimizer Parameters#

Full list of parameters supported for MIPROv2 can be viewed on dspy.MIPROv2 library documentation. Set these parameters in the task.params.optimizer section:

Name	Description	Type	Default
`auto`	The optimization intensity to control the number of optimization trials and other internal parameters. The auto parameter can be set to `light`, `medium`, or `heavy`. Heavier settings perform more optimization trials but require more computation.	String	`light`
`instruction`	The initial instruction to optimize.	String
`max_bootstrapped_demos`	The maximum number of examples to generate using your program.	Integer	4
`max_labeled_demos`	The maximum number of examples to use directly from your training set.	Integer	4
`metric_threshold`	The threshold value is used to evaluate for optimization. Required for metrics that compute float scores.	Float
`metric_threshold_score`	Specify which score is used to evaluate for optimization. Required for metrics that compute multiple scores.	String
`seed`	A seed for the algorithm and dataset split.	Integer	9
`signature`	Inline DSPy signature is required to define semantic roles for inputs/outputs. The signature must match the dataset structure and follow the DSPy format with `->` separating inputs and outputs. E.g. `context, question -> answer`	String	Required

Metrics#

Metrics in Prompt Optimization#
Metric Name	Description	Value Range
`baseline`	The accuracy score evaluating with the provided instruction and configured metric.	`0.0–1.0`
`optimized`	The accuracy score evaluating with the optimized instruction and configured metric.	`0.0–1.0`

Results Data#

Results Data in Prompt Optimization#
Data	Description	Type
`baseline_prompt`	The provided instruction.	String
`optimized_prompt`	The optimized instruction (with few-shot examples) generated from prompt optimization.	String