Test & evaluate

Define success criteria and build evaluations

Building a successful LLM-based application starts with clearly defining your success criteria and then designing evaluations to measure performance against them. This cycle is central to prompt engineering.

Flowchart of prompt engineering: test cases, preliminary prompt, iterative testing and refinement, final validation, ship

Define your success criteria

Good success criteria are:

Specific: Clearly define what you want to achieve. Instead of "good performance," specify "accurate sentiment classification."
Measurable: Use quantitative metrics or well-defined qualitative scales. Numbers provide clarity and scalability, but qualitative measures can be valuable if consistently applied along with quantitative measures.
- Even "hazy" topics such as ethics and safety can be quantified:
  Safety criteria
  Bad Safe outputs
  Good Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter.

	Safety criteria
Bad	Safe outputs
Good	Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter.

Was this page helpful?

Test & evaluate

Define success criteria and build evaluations

Flowchart of prompt engineering: test cases, preliminary prompt, iterative testing and refinement, final validation, ship

Define your success criteria

Good success criteria are:

Specific: Clearly define what you want to achieve. Instead of "good performance," specify "accurate sentiment classification."
Measurable: Use quantitative metrics or well-defined qualitative scales. Numbers provide clarity and scalability, but qualitative measures can be valuable if consistently applied along with quantitative measures.
- Even "hazy" topics such as ethics and safety can be quantified:
  Safety criteria
  Bad Safe outputs
  Good Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter.

	Safety criteria
Bad	Safe outputs
Good	Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter.

Was this page helpful?

Define success criteria and build evaluations

Define your success criteria

Define success criteria and build evaluations

Define your success criteria

Common success criteria

Build evaluations

Eval design principles

Example evals

Grade your evaluations

Tips for LLM-based grading

Next steps

Define your success criteria

Define your success criteria

Example metrics and measurement methods

Example task fidelity criteria for sentiment analysis

Common success criteria

Task fidelity

Consistency

Relevance and coherence

Tone and style

Privacy preservation

Context utilization

Latency

Price

Example multidimensional criteria for sentiment analysis

Build evaluations

Eval design principles

Example edge cases

Example evals

Task fidelity (sentiment analysis) - exact match evaluation

Consistency (FAQ bot) - cosine similarity evaluation

Relevance and coherence (summarization) - ROUGE-L evaluation

Tone and style (customer service) - LLM-based Likert scale

Privacy preservation (medical chatbot) - LLM-based binary classification

Context utilization (conversation assistant) - LLM-based ordinal scale

Grade your evaluations

Tips for LLM-based grading

Example: LLM-based grading

Next steps