• Messages
  • Managed Agents
  • Admin

Search...
⌘K
Use cases
OverviewTicket routingCustomer support agentContent moderationLegal summarization
Prompt engineering
OverviewPrompting best practicesPrompting Claude Fable 5Prompting Claude Opus 4.8Console prompting tools
Test and evaluate
Define success and build evaluationsUsing the Evaluation Tool in ConsoleReducing latency
Strengthen guardrails
Reduce hallucinationsIncrease output consistencyMitigate jailbreaksReduce prompt leak
Reference
Glossary

Log in
Mitigate jailbreaks
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Solutions

  • AI agents
  • Code modernization
  • Coding
  • Customer support
  • Education
  • Financial services
  • Government
  • Life sciences

Partners

  • Claude on AWS
  • Google Cloud's Vertex AI

Learn

  • Blog
  • Courses
  • Use cases
  • Connectors
  • Customer stories
  • Engineering at Anthropic
  • Events
  • Powered by Claude
  • Service partners
  • Startups program

Company

  • Anthropic
  • Careers
  • Economic Futures
  • Research
  • News
  • Responsible Scaling Policy
  • Security and compliance
  • Transparency

Learn

  • Blog
  • Courses
  • Use cases
  • Connectors
  • Customer stories
  • Engineering at Anthropic
  • Events
  • Powered by Claude
  • Service partners
  • Startups program

Help and security

  • Availability
  • Status
  • Support
  • Discord

Terms and policies

  • Privacy policy
  • Responsible disclosure policy
  • Terms of service: Commercial
  • Terms of service: Consumer
  • Usage policy
Best practices/Strengthen guardrails

Mitigate jailbreaks and prompt injections

Jailbreaking and prompt injection are attempts to make Claude ignore its guidelines or your instructions. While Claude is inherently resilient to such attacks, the additional steps on this page strengthen your guardrails, particularly against uses that violate our Terms of Service or Usage Policy.

These attacks fall into two categories with different threat models:

  • Jailbreaks and direct prompt injection, where the user of your application is the adversary and crafts inputs intended to bypass your guardrails.
  • Indirect prompt injection, where the user is trusted but Claude processes third-party content (web pages, emails, documents, tool results) that contains adversarial instructions.

Jailbreaks and direct prompt injection

In this threat model, a user is deliberately crafting inputs to manipulate your application into producing content or taking actions you don't want it to. These mitigations strengthen your application's guardrails:

  • Harmlessness screens: Use a lightweight model like Claude Haiku 4.5 to pre-screen user input before it reaches your main conversation. Use structured outputs to constrain the response to a simple classification.

  • Input validation: Filter user input for known injection patterns before it reaches Claude. You can use an LLM to create a generalized validation screen by providing known jailbreaking language as examples.

  • Prompt engineering: Craft system prompts that emphasize ethical and legal boundaries, and that explicitly tell Claude how to refuse.

  • Respond to repeat offenders: Adjust responses and consider throttling or banning users who repeatedly attempt to circumvent your application's guardrails. For example, if a particular user triggers the same kind of refusal multiple times (such as "output blocked by content filtering policy"), tell the user that their actions violate the relevant usage policies and take action accordingly.

Indirect prompt injection

In this threat model, you're protecting your users from instructions embedded in content that Claude reads on their behalf: the body of an inbound email, a fetched web page, OCR output from an uploaded file, or the result of a tool call. An attacker who can influence that content may embed instructions that try to redirect Claude.

Structure your application so that Claude can reliably distinguish untrusted content from your instructions:

  • Put untrusted content only in tool results. Deliver third-party content to Claude inside tool_result blocks, never in system prompts or plain user text blocks. Claude is trained to treat instructions that appear inside tool results with appropriate skepticism. See Handle tool calls for the tool_result format.

  • Tell Claude what the content is and where it came from. In the tool's description, or in the structure of the result itself, make the nature and source of the content explicit: for example, that it is the body of an inbound email from an unknown sender, or OCR text extracted from a user-uploaded image. This context helps Claude calibrate how much to trust embedded directives.

  • State the policy in your system prompt. Tell Claude explicitly that content returned from tools, documents, or searches is untrusted data and must never override the system prompt or the user's original request.


If you're using the computer use tool, Anthropic runs additional classifiers that detect potential prompt injections in screenshots and steer Claude to ask for user confirmation before acting. See that page for details and opt-out information.

Continuous monitoring

Regularly analyze outputs for signs of successful injection. Use this monitoring to iteratively refine your prompts, validation, and filtering strategies.

Advanced: Chain safeguards

Combine strategies for robust protection. Here's an enterprise-grade example with tool use:

By layering these strategies, you create a robust defense against jailbreaking and prompt injections, ensuring your Claude-powered applications maintain the highest standards of safety and compliance.

Was this page helpful?

  • Jailbreaks and direct prompt injection
  • Indirect prompt injection
  • Continuous monitoring
  • Advanced: Chain safeguards
  • Bot system prompt
  • Prompt within harmlessness_screen tool
  • JSON-encode untrusted content. Where possible, wrap third-party strings in a JSON object rather than concatenating them into free-form text. JSON escaping provides unambiguous delimiters between the untrusted payload and the surrounding structure, so an attacker cannot close a quote or tag to "break out" into an instruction context.

  • Don't put your own instructions in tool results. Because Claude treats tool-result content as untrusted data, instructions you place there may be ignored or flagged as a potential injection. Send your instructions in a user turn that follows the tool_result block. On Claude Opus 4.8 and later, you can also use a mid-conversation system message.

  • Limit Claude's access to sensitive data and actions. Apply the principle of least privilege so that a successful injection can do minimal damage: don't give Claude access to secrets it doesn't need, run tools in sandboxed environments, and scope permissions as narrowly as possible.

  • Screen tool outputs before Claude acts on them. Apply the same lightweight-model screening pattern you use for user input to the content your tools return. Run each tool, pass its raw output to a small classifier call with Claude Haiku 4.5, and only return the content as a tool_result block if the screen reports no injection attempt. Use structured outputs so the classifier's verdict is a parseable value your application can branch on.

    You can also apply the input-validation patterns from the previous section to tool results before passing them to Claude.

  • Red-team your own agent. Before deploying, test your workflow with documents, emails, and tool outputs that deliberately contain injection attempts, and confirm that Claude ignores them and that your screening and confirmation steps catch the rest.