Was this page helpful?
After defining your success criteria, the next step is designing evaluations to measure LLM performance against those criteria. This is a vital part of the prompt engineering cycle.

This guide focuses on how to develop your test cases.
When deciding which method to use to grade evals, choose the fastest, most reliable, most scalable method:
Code-based grading: Fastest and most reliable, extremely scalable, but also lacks nuance for more complex judgements that require less rule-based rigidity.
output == golden_answerkey_phrase in outputHuman grading: Most flexible and high quality, but slow and expensive. Avoid if possible.
LLM-based grading: Fast and flexible, scalable and suitable for complex judgement. Test to ensure reliability first then scale.
More code examples of human-, code-, and LLM-graded evals.