Model Evaluation How-To Guides#

This section provides task-focused procedures for running eval/model_eval. For your first run, start with Getting Started With Model Evaluation. For agent-driven sessions, read Use The Model Evaluation Skill With Confidence first.

Choose A Guide#

Discover The Step

List the step, read its contract, and decide whether it applies to the task.

Discover The Model Evaluation Step
Run A Hosted Evaluation

Run benchmarks against an already-running, OpenAI-compatible endpoint.

Run A Hosted Evaluation
Evaluate A Deployed Checkpoint

Choose a deployment path, deploy the endpoint, and point the step at it.

Evaluate A Deployed Checkpoint