Loading...
    • Messages
    • Managed Agents
    • Admin
    Search...
    ⌘K
    Use cases
    OverviewTicket routingCustomer support agentContent moderationLegal summarization
    Prompt engineering
    OverviewPrompting best practicesConsole prompting tools
    Test and evaluate
    Define success and build evaluationsUsing the Evaluation Tool in ConsoleReducing latency
    Strengthen guardrails
    Reduce hallucinationsIncrease output consistencyMitigate jailbreaksReduce prompt leak
    Reference
    Glossary
    Log in
    Mitigate jailbreaks
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...
    Loading...

    Solutions

    • AI agents
    • Code modernization
    • Coding
    • Customer support
    • Education
    • Financial services
    • Government
    • Life sciences

    Partners

    • Amazon Bedrock
    • Google Cloud's Vertex AI

    Learn

    • Blog
    • Courses
    • Use cases
    • Connectors
    • Customer stories
    • Engineering at Anthropic
    • Events
    • Powered by Claude
    • Service partners
    • Startups program

    Company

    • Anthropic
    • Careers
    • Economic Futures
    • Research
    • News
    • Responsible Scaling Policy
    • Security and compliance
    • Transparency

    Learn

    • Blog
    • Courses
    • Use cases
    • Connectors
    • Customer stories
    • Engineering at Anthropic
    • Events
    • Powered by Claude
    • Service partners
    • Startups program

    Help and security

    • Availability
    • Status
    • Support
    • Discord

    Terms and policies

    • Privacy policy
    • Responsible disclosure policy
    • Terms of service: Commercial
    • Terms of service: Consumer
    • Usage policy
    Best practices/Strengthen guardrails

    Mitigate jailbreaks and prompt injections

    Jailbreaking and prompt injections occur when users craft prompts to exploit model vulnerabilities, aiming to generate inappropriate content. While Claude is inherently resilient to such attacks, here are additional steps to strengthen your guardrails, particularly against uses that either violate our Terms of Service or Usage Policy.

    • Harmlessness screens: Use a lightweight model like Claude Haiku 4.5 to pre-screen user inputs. Use structured outputs to constrain the response to a simple classification.

    • Input validation: Filter prompts for jailbreaking patterns. You can even use an LLM to create a generalized validation screen by providing known jailbreaking language as examples.

    • Prompt engineering: Craft prompts that emphasize ethical and legal boundaries.

    Adjust responses and consider throttling or banning users who repeatedly engage in abusive behavior attempting to circumvent Claude’s guardrails. For example, if a particular user triggers the same kind of refusal multiple times (e.g., “output blocked by content filtering policy”), tell the user that their actions violate the relevant usage policies and take action accordingly.

    • Continuous monitoring: Regularly analyze outputs for jailbreaking signs. Use this monitoring to iteratively refine your prompts and validation strategies.

    Advanced: Chain safeguards

    Combine strategies for robust protection. Here's an enterprise-grade example with tool use:

    By layering these strategies, you create a robust defense against jailbreaking and prompt injections, ensuring your Claude-powered applications maintain the highest standards of safety and compliance.

    Was this page helpful?

    • Advanced: Chain safeguards
    • Bot system prompt
    • Prompt within harmlessness_screen tool