Prompt caching optimizes your API usage by allowing resuming from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.
Prompt caching stores KV cache representations and cryptographic hashes of cached content, but does not store the raw text of prompts or responses. This may be suitable for customers who require ZDR-type data retention commitments. See cache lifetime for details.
There are two ways to enable prompt caching:
cache_control field at the top level of your request. The system automatically applies the cache breakpoint to the last cacheable block and moves it forward as conversations grow. Best for multi-turn conversations where the growing message history should be cached automatically.cache_control directly on individual content blocks for fine-grained control over exactly what gets cached.The simplest way to start is with automatic caching:
curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-opus-4-6",
"max_tokens": 1024,
"cache_control": {"type": "ephemeral"},
"system": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.",
"messages": [
{
"role": "user",
"content": "Analyze the major themes in Pride and Prejudice."
}
]
}'With automatic caching, the system caches all content up to and including the last cacheable block. On subsequent requests with the same prefix, cached content is reused automatically.
When you send a request with prompt caching enabled:
This is especially useful for:
By default, the cache has a 5-minute lifetime. The cache is refreshed for no additional cost each time the cached content is used.
If you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration at additional cost.
For more information, see 1-hour cache duration.
Prompt caching caches the full prefix
Prompt caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control.
Prompt caching introduces a new pricing structure. The table below shows the price per million tokens for each supported model:
| Model | Base Input Tokens | 5m Cache Writes | 1h Cache Writes | Cache Hits & Refreshes | Output Tokens |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5 / MTok | $6.25 / MTok | $10 / MTok | $0.50 / MTok | $25 / MTok |
| Claude Opus 4.5 | $5 / MTok | $6.25 / MTok | $10 / MTok | $0.50 / MTok | $25 / MTok |
| Claude Opus 4.1 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
| Claude Opus 4 | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
| Claude Sonnet 4.6 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
| Claude Sonnet 4.5 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
| Claude Sonnet 4 | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
| Claude Sonnet 3.7 (deprecated) | $3 / MTok | $3.75 / MTok | $6 / MTok | $0.30 / MTok | $15 / MTok |
| Claude Haiku 4.5 | $1 / MTok | $1.25 / MTok | $2 / MTok | $0.10 / MTok | $5 / MTok |
| Claude Haiku 3.5 | $0.80 / MTok | $1 / MTok | $1.6 / MTok | $0.08 / MTok | $4 / MTok |
| Claude Opus 3 (deprecated) | $15 / MTok | $18.75 / MTok | $30 / MTok | $1.50 / MTok | $75 / MTok |
| Claude Haiku 3 | $0.25 / MTok | $0.30 / MTok | $0.50 / MTok | $0.03 / MTok | $1.25 / MTok |
The table above reflects the following pricing multipliers for prompt caching:
These multipliers stack with other pricing modifiers such as the Batch API discount, long context pricing, and data residency. See pricing for full details.
Prompt caching (both automatic and explicit) is currently supported on:
Automatic caching is the simplest way to enable prompt caching. Instead of placing cache_control on individual content blocks, add a single cache_control field at the top level of your request body. The system automatically applies the cache breakpoint to the last cacheable block.
curl https://api.anthropic.com/v1/messages \
-H "content-type: application/json" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "claude-opus-4-6",
"max_tokens": 1024,
"cache_control": {"type": "ephemeral"},
"system": "You are a helpful assistant that remembers our conversation.",
"messages": [
{"role": "user", "content": "My name is Alex. I work on machine learning."},
{"role": "assistant", "content": "Nice to meet you, Alex! How can I help with your ML work today?"},
{"role": "user", "content": "What did I say I work on?"}
]
}'With automatic caching, the cache point moves forward automatically as conversations grow. Each new request caches everything up to the last cacheable block, and previous content is read from cache.
| Request | Content | Cache behavior |
|---|---|---|
| Request 1 | System + User(1) + Asst(1) + User(2) ◀ cache | Everything written to cache |
| Request 2 | System + User(1) + Asst(1) + User(2) + Asst(2) + User(3) ◀ cache | System through User(2) read from cache; Asst(2) + User(3) written to cache |
| Request 3 | System + User(1) + Asst(1) + User(2) + Asst(2) + User(3) + Asst(3) + User(4) ◀ cache | System through User(3) read from cache; Asst(3) + User(4) written to cache |
The cache breakpoint automatically moves to the last cacheable block in each request, so you don't need to update any cache_control markers as the conversation grows.
By default, automatic caching uses a 5-minute TTL. You can specify a 1-hour TTL at 2x the base input token price:
{ "cache_control": { "type": "ephemeral", "ttl": "1h" } }Automatic caching is compatible with explicit cache breakpoints. When used together, the automatic cache breakpoint uses one of the 4 available breakpoint slots.
This lets you combine both approaches. For example, use explicit breakpoints to cache your system prompt and tools independently, while automatic caching handles the conversation:
{
"model": "claude-opus-4-6",
"max_tokens": 1024,
"cache_control": { "type": "ephemeral" },
"system": [
{
"type": "text",
"text": "You are a helpful assistant.",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [{ "role": "user", "content": "What are the key terms?" }]
}Automatic caching uses the same underlying caching infrastructure. Pricing, minimum token thresholds, context ordering requirements, and the 20-block lookback window all apply the same as with explicit breakpoints.
cache_control with the same TTL, automatic caching is a no-op.cache_control with a different TTL, the API returns a 400 error.Automatic caching is available on the Claude API and Azure AI Foundry (preview). Support for Amazon Bedrock and Google Vertex AI is coming later.
For more control over caching, you can place cache_control directly on individual content blocks. This is useful when you need to cache different sections that change at different frequencies, or need fine-grained control over exactly what gets cached.
Place static content (tool definitions, system instructions, context, examples) at the beginning of your prompt. Mark the end of the reusable content for caching using the cache_control parameter.
Cache prefixes are created in the following order: tools, system, then messages. This order forms a hierarchy where each level builds upon the previous ones.
You can use just one cache breakpoint at the end of your static content, and the system will automatically find the longest matching sequence of cached blocks. Understanding how this works helps you optimize your caching strategy.
Three core principles:
Cache keys are cumulative: When you explicitly cache a block with cache_control, the cache hash key is generated by hashing all previous blocks in the conversation sequentially. This means the cache for each block depends on all content that came before it.
Backward sequential checking: The system checks for cache hits by working backwards from your explicit breakpoint, checking each previous block in reverse order. This ensures you get the longest possible cache hit.
20-block lookback window: The system only checks up to 20 blocks before each explicit cache_control breakpoint. After checking 20 blocks without a match, it stops checking and moves to the next explicit breakpoint (if any).
Example: Understanding the lookback window
Consider a conversation with 30 content blocks where you set cache_control only on block 30:
If you send block 31 with no changes to previous blocks: The system checks block 30 (match!). You get a cache hit at block 30, and only block 31 needs processing.
If you modify block 25 and send block 31: The system checks backwards from block 30 → 29 → 28... → 25 (no match) → 24 (match!). Since block 24 hasn't changed, you get a cache hit at block 24, and only blocks 25-30 need reprocessing.
If you modify block 5 and send block 31: The system checks backwards from block 30 → 29 → 28... → 11 (check #20). After 20 checks without finding a match, it stops looking. Since block 5 is beyond the 20-block window, no cache hit occurs and all blocks need reprocessing. However, if you had set an explicit cache_control breakpoint on block 5, the system would continue checking from that breakpoint: block 5 (no match) → block 4 (match!). This allows a cache hit at block 4, demonstrating why you should place breakpoints before editable content.
Key takeaway: Always set an explicit cache breakpoint at the end of your conversation to maximize your chances of cache hits. Additionally, set breakpoints just before content blocks that might be editable to ensure those sections can be cached independently.
You can define up to 4 cache breakpoints if you want to:
Important limitation: If your prompt has more than 20 content blocks before your cache breakpoint, and you modify content earlier than those 20 blocks, you won't get a cache hit unless you add additional explicit breakpoints closer to that content.
Cache breakpoints themselves don't add any cost. You are only charged for:
Adding more cache_control breakpoints doesn't increase your costs - you still pay the same amount based on what content is actually cached and read. The breakpoints simply give you control over what sections can be cached independently.
The minimum cacheable prompt length is:
Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching. To see if a prompt was cached, see the response usage fields.
For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests.
Currently, "ephemeral" is the only supported cache type, which by default has a 5-minute lifetime.
Most blocks in the request can be cached. This includes:
tools arraysystem arraymessages.content array, for both user and assistant turnsmessages.content array, in user turnsmessages.content array, in both user and assistant turnsEach of these elements can be cached, either automatically or by marking them with cache_control.
While most request blocks can be cached, there are some exceptions:
Thinking blocks cannot be cached directly with cache_control. However, thinking blocks CAN be cached alongside other content when they appear in previous assistant turns. When cached this way, they DO count as input tokens when read from cache.
Sub-content blocks (like citations) themselves cannot be cached directly. Instead, cache the top-level block.
In the case of citations, the top-level document content blocks that serve as the source material for citations can be cached. This allows you to use prompt caching with citations effectively by caching the documents that citations will reference.
Empty text blocks cannot be cached.
Modifications to cached content can invalidate some or all of the cache.
As described in Structuring your prompt, the cache follows the hierarchy: tools → system → messages. Changes at each level invalidate that level and all subsequent levels.
The following table shows which parts of the cache are invalidated by different types of changes. ✘ indicates that the cache is invalidated, while ✓ indicates that the cache remains valid.
| What changes | Tools cache | System cache | Messages cache | Impact |
|---|---|---|---|---|
| Tool definitions | ✘ | ✘ | ✘ | Modifying tool definitions (names, descriptions, parameters) invalidates the entire cache |
| Web search toggle | ✓ | ✘ | ✘ | Enabling/disabling web search modifies the system prompt |
| Citations toggle | ✓ | ✘ | ✘ | Enabling/disabling citations modifies the system prompt |
| Speed setting | ✓ | ✘ | ✘ | Switching between speed: "fast" and standard speed invalidates system and message caches |
| Tool choice | ✓ | ✓ | ✘ | Changes to tool_choice parameter only affect message blocks |
| Images | ✓ | ✓ | ✘ | Adding/removing images anywhere in the prompt affects message blocks |
| Thinking parameters | ✓ | ✓ | ✘ | Changes to extended thinking settings (enable/disable, budget) affect message blocks |
| Non-tool results passed to extended thinking requests | ✓ | ✓ | ✘ | When non-tool results are passed in requests while extended thinking is enabled, all previously-cached thinking blocks are stripped from context, and any messages in context that follow those thinking blocks are removed from the cache. For more details, see Caching with thinking blocks. |
Monitor cache performance using these API response fields, within usage in the response (or message_start event if streaming):
cache_creation_input_tokens: Number of tokens written to the cache when creating a new entry.cache_read_input_tokens: Number of tokens retrieved from the cache for this request.input_tokens: Number of input tokens which were not read from or used to create a cache (that is, tokens after the last cache breakpoint).Understanding the token breakdown
The input_tokens field represents only the tokens that come after the last cache breakpoint in your request - not all the input tokens you sent.
To calculate total input tokens:
total_input_tokens = cache_read_input_tokens + cache_creation_input_tokens + input_tokensSpatial explanation:
cache_read_input_tokens = tokens before breakpoint already cached (reads)cache_creation_input_tokens = tokens before breakpoint being cached now (writes)input_tokens = tokens after your last breakpoint (not eligible for cache)Example: If you have a request with 100,000 tokens of cached content (read from cache), 0 tokens of new content being cached, and 50 tokens in your user message (after the cache breakpoint):
cache_read_input_tokens: 100,000cache_creation_input_tokens: 0input_tokens: 50This is important for understanding both costs and rate limits, as input_tokens will typically be much smaller than your total input when using caching effectively.
When using extended thinking with prompt caching, thinking blocks have special behavior:
Automatic caching alongside other content: While thinking blocks cannot be explicitly marked with cache_control, they get cached as part of the request content when you make subsequent API calls with tool results. This commonly happens during tool use when you pass thinking blocks back to continue the conversation.
Input token counting: When thinking blocks are read from cache, they count as input tokens in your usage metrics. This is important for cost calculation and token budgeting.
Cache invalidation patterns:
cache_control markersFor more details on cache invalidation, see What invalidates the cache.
Example with tool use:
Request 1: User: "What's the weather in Paris?"
Response: [thinking_block_1] + [tool_use block 1]
Request 2:
User: ["What's the weather in Paris?"],
Assistant: [thinking_block_1] + [tool_use block 1],
User: [tool_result_1, cache=True]
Response: [thinking_block_2] + [text block 2]
# Request 2 caches its request content (not the response)
# The cache includes: user message, thinking_block_1, tool_use block 1, and tool_result_1
Request 3:
User: ["What's the weather in Paris?"],
Assistant: [thinking_block_1] + [tool_use block 1],
User: [tool_result_1, cache=True],
Assistant: [thinking_block_2] + [text block 2],
User: [Text response, cache=True]
# Non-tool-result user block causes all thinking blocks to be ignored
# This request is processed as if thinking blocks were never presentWhen a non-tool-result user block is included, it designates a new assistant loop and all previous thinking blocks are removed from context.
For more detailed information, see the extended thinking documentation.
Starting February 5, 2026, prompt caching will use workspace-level isolation instead of organization-level isolation. Caches will be isolated per workspace, ensuring data separation between workspaces within the same organization. This change applies to the Claude API and Azure AI Foundry (preview); Amazon Bedrock and Google Vertex AI will maintain organization-level cache isolation. If you use multiple workspaces, review your caching strategy to account for this change.
Organization Isolation: Caches are isolated between organizations. Different organizations never share caches, even if they use identical prompts.
Exact Matching: Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control.
Output Token Generation: Prompt caching has no effect on output token generation. The response you receive will be identical to what you would get if prompt caching was not used.
To optimize prompt caching performance:
Tailor your prompt caching strategy to your scenario:
If experiencing unexpected behavior:
cache_control markers are in the same locationstool_choice and image usage remain consistent between callscache_control parameters earlier in the prompt to ensure all content can be cachedtool_use content blocks have stable ordering as some languages (for example, Swift, Go) randomize key order during JSON conversion, breaking cachesChanges to tool_choice or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created. For more details on cache invalidation, see What invalidates the cache.
If you find that 5 minutes is too short, Anthropic also offers a 1-hour cache duration at additional cost.
To use the extended cache, include ttl in the cache_control definition like this:
"cache_control": {
"type": "ephemeral",
"ttl": "1h"
}The response will include detailed cache information like the following:
{
"usage": {
"input_tokens": 2048,
"cache_read_input_tokens": 1800,
"cache_creation_input_tokens": 248,
"output_tokens": 503,
"cache_creation": {
"ephemeral_5m_input_tokens": 456,
"ephemeral_1h_input_tokens": 100
}
}
}Note that the current cache_creation_input_tokens field equals the sum of the values in the cache_creation object.
If you have prompts that are used at a regular cadence (that is, system prompts that are used more frequently than every 5 minutes), continue to use the 5-minute cache, since this will continue to be refreshed at no additional charge.
The 1-hour cache is best used in the following scenarios:
The 5-minute and 1-hour cache behave the same with respect to latency. You will generally see improved time-to-first-token for long documents.
You can use both 1-hour and 5-minute cache controls in the same request, but with an important constraint: Cache entries with longer TTL must appear before shorter TTLs (that is, a 1-hour cache entry must appear before any 5-minute cache entries).
When mixing TTLs, the API determines three billing locations in your prompt:
A: The token count at the highest cache hit (or 0 if no hits).B: The token count at the highest 1-hour cache_control block after A (or equals A if none exist).C: The token count at the last cache_control block.If B and/or C are larger than A, they will necessarily be cache misses, because A is the highest cache hit.
You'll be charged for:
A.(B - A).(C - B).Here are 3 examples. This depicts the input tokens of 3 requests, each of which has different cache hits and cache misses. Each has a different calculated pricing, shown in the colored boxes, as a result.
To help you get started with prompt caching, the prompt caching cookbook provides detailed examples and best practices.
The following code snippets showcase various prompt caching patterns. These examples demonstrate how to implement caching in different scenarios, helping you understand the practical applications of this feature:
Was this page helpful?