Loading...
    • Developer Guide
    • API Reference
    • MCP
    • Resources
    • Release Notes
    Search...
    ⌘K

    First steps

    Intro to ClaudeQuickstart

    Models & pricing

    Models overviewChoosing a modelWhat's new in Claude 4.5Migrating to Claude 4.5Model deprecationsPricing

    Build with Claude

    Features overviewUsing the Messages APIContext windowsPrompting best practices

    Capabilities

    Prompt cachingContext editingExtended thinkingStreaming MessagesBatch processingCitationsMultilingual supportToken countingEmbeddingsVisionPDF supportFiles APISearch resultsGoogle Sheets add-on

    Tools

    OverviewHow to implement tool useToken-efficient tool useFine-grained tool streamingBash toolCode execution toolComputer use toolText editor toolWeb fetch toolWeb search toolMemory tool

    Agent Skills

    OverviewQuickstartBest practicesUsing Skills with the API

    Agent SDK

    OverviewTypeScript SDKPython SDK

    Guides

    Streaming InputHandling PermissionsSession ManagementHosting the Agent SDKModifying system promptsMCP in the SDKCustom ToolsSubagents in the SDKSlash Commands in the SDKAgent Skills in the SDKTracking Costs and UsageTodo ListsPlugins in the SDK

    MCP in the API

    MCP connectorRemote MCP servers

    Claude on 3rd-party platforms

    Amazon BedrockVertex AI

    Prompt engineering

    OverviewPrompt generatorUse prompt templatesPrompt improverBe clear and directUse examples (multishot prompting)Let Claude think (CoT)Use XML tagsGive Claude a role (system prompts)Prefill Claude's responseChain complex promptsLong context tipsExtended thinking tips

    Test & evaluate

    Define success criteriaDevelop test casesUsing the Evaluation ToolReducing latency

    Strengthen guardrails

    Reduce hallucinationsIncrease output consistencyMitigate jailbreaksStreaming refusalsReduce prompt leakKeep Claude in character

    Administration and monitoring

    Admin API overviewUsage and Cost APIClaude Code Analytics API
    Console
    Test & evaluate

    Create strong empirical evaluations

    After defining your success criteria, the next step is designing evaluations to measure LLM performance against those criteria. This is a vital part of the prompt engineering cycle.

    This guide focuses on how to develop your test cases.

    Building evals and test cases

    Eval design principles

    1. Be task-specific: Design evals that mirror your real-world task distribution. Don't forget to factor in edge cases!

    2. Automate when possible: Structure questions to allow for automated grading (e.g., multiple-choice, string match, code-graded, LLM-graded).
    3. Prioritize volume over quality: More questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals.

    Example evals

    Writing hundreds of test cases can be hard to do by hand! Get Claude to help you generate more from a baseline set of example test cases.
    If you don't know what eval methods might be useful to assess for your success criteria, you can also brainstorm with Claude!

    Grading evals

    When deciding which method to use to grade evals, choose the fastest, most reliable, most scalable method:

    1. Code-based grading: Fastest and most reliable, extremely scalable, but also lacks nuance for more complex judgements that require less rule-based rigidity.

      • Exact match: output == golden_answer
      • String match: key_phrase in output
    2. Human grading: Most flexible and high quality, but slow and expensive. Avoid if possible.

    3. LLM-based grading: Fast and flexible, scalable and suitable for complex judgement. Test to ensure reliability first then scale.

    Tips for LLM-based grading

    • Have detailed, clear rubrics: "The answer should always mention 'Acme Inc.' in the first sentence. If it does not, the answer is automatically graded as 'incorrect.'"
      A given use case, or even a specific success criteria for that use case, might require several rubrics for holistic evaluation.
    • Empirical or specific: For example, instruct the LLM to output only 'correct' or 'incorrect', or to judge from a scale of 1-5. Purely qualitative evaluations are hard to assess quickly and at scale.
    • Encourage reasoning: Ask the LLM to think first before deciding an evaluation score, and then discard the reasoning. This increases evaluation performance, particularly for tasks requiring complex judgement.

    Next steps

    Brainstorm evaluations

    Learn how to craft prompts that maximize your eval scores.

    Evals cookbook

    More code examples of human-, code-, and LLM-graded evals.

    • Building evals and test cases
    • Eval design principles
    • Example evals
    • Grading evals
    • Tips for LLM-based grading
    © 2025 ANTHROPIC PBC

    Products

    • Claude
    • Claude Code
    • Max plan
    • Team plan
    • Enterprise plan
    • Download app
    • Pricing
    • Log in

    Features

    • Claude and Slack
    • Claude in Excel

    Models

    • Opus
    • Sonnet
    • Haiku

    Solutions

    • AI agents
    • Code modernization
    • Coding
    • Customer support
    • Education
    • Financial services
    • Government
    • Life sciences

    Claude Developer Platform

    • Overview
    • Developer docs
    • Pricing
    • Amazon Bedrock
    • Google Cloud’s Vertex AI
    • Console login

    Learn

    • Blog
    • Catalog
    • Courses
    • Connectors
    • Customer stories
    • Engineering at Anthropic
    • Events
    • Powered by Claude
    • Service partners
    • Startups program

    Company

    • Anthropic
    • Careers
    • Economic Futures
    • Research
    • News
    • Responsible Scaling Policy
    • Security and compliance
    • Transparency

    Help and security

    • Availability
    • Status
    • Support center

    Terms and policies

    • Privacy policy
    • Responsible disclosure policy
    • Terms of service: Commercial
    • Terms of service: Consumer
    • Usage policy

    Products

    • Claude
    • Claude Code
    • Max plan
    • Team plan
    • Enterprise plan
    • Download app
    • Pricing
    • Log in

    Features

    • Claude and Slack
    • Claude in Excel

    Models

    • Opus
    • Sonnet
    • Haiku

    Solutions

    • AI agents
    • Code modernization
    • Coding
    • Customer support
    • Education
    • Financial services
    • Government
    • Life sciences

    Claude Developer Platform

    • Overview
    • Developer docs
    • Pricing
    • Amazon Bedrock
    • Google Cloud’s Vertex AI
    • Console login

    Learn

    • Blog
    • Catalog
    • Courses
    • Connectors
    • Customer stories
    • Engineering at Anthropic
    • Events
    • Powered by Claude
    • Service partners
    • Startups program

    Company

    • Anthropic
    • Careers
    • Economic Futures
    • Research
    • News
    • Responsible Scaling Policy
    • Security and compliance
    • Transparency

    Help and security

    • Availability
    • Status
    • Support center

    Terms and policies

    • Privacy policy
    • Responsible disclosure policy
    • Terms of service: Commercial
    • Terms of service: Consumer
    • Usage policy
    © 2025 ANTHROPIC PBC