Extended thinking gives Claude enhanced reasoning capabilities for complex tasks, while providing varying levels of transparency into its step-by-step thought process before it delivers its final answer.
For Claude Opus 4.6, we recommend using adaptive thinking (thinking: {type: "adaptive"}) with the effort parameter instead of the manual thinking mode described on this page. The manual thinking: {type: "enabled", budget_tokens: N} configuration is deprecated on Opus 4.6 and will be removed in a future model release.
Extended thinking is supported in the following models:
claude-opus-4-6) — adaptive thinking recommended; manual mode (type: "enabled") is deprecatedclaude-opus-4-5-20251101)claude-opus-4-1-20250805)claude-opus-4-20250514)claude-sonnet-4-5-20250929)claude-sonnet-4-20250514)claude-3-7-sonnet-20250219) (deprecated)claude-haiku-4-5-20251001)API behavior differs across Claude Sonnet 3.7 and Claude 4 models, but the API shapes remain exactly the same.
For more information, see Differences in thinking across model versions.
When extended thinking is turned on, Claude creates thinking content blocks where it outputs its internal reasoning. Claude incorporates insights from this reasoning before crafting a final response.
The API response will include thinking content blocks, followed by text content blocks.
Here's an example of the default response format:
{
"content": [
{
"type": "thinking",
"thinking": "Let me analyze this step by step...",
"signature": "WaUjzkypQ2mUEVM36O2TxuC06KN8xyfbJwyem2dw3URve/op91XWHOEBLLqIOMfFG/UvLEczmEsUjavL...."
},
{
"type": "text",
"text": "Based on my analysis..."
}
]
}For more information about the response format of extended thinking, see the Messages API Reference.
Here is an example of using extended thinking in the Messages API:
curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $ANTHROPIC_API_KEY" \
--header "anthropic-version: 2023-06-01" \
--header "content-type: application/json" \
--data \
'{
"model": "claude-sonnet-4-5",
"max_tokens": 16000,
"thinking": {
"type": "enabled",
"budget_tokens": 10000
},
"messages": [
{
"role": "user",
"content": "Are there an infinite number of prime numbers such that n mod 4 == 3?"
}
]
}'To turn on extended thinking, add a thinking object, with the type parameter set to enabled and the budget_tokens to a specified token budget for extended thinking. For Claude Opus 4.6, we recommend using type: "adaptive" instead — see Adaptive thinking for details. While type: "enabled" with budget_tokens is still supported on Opus 4.6, it is deprecated and will be removed in a future release.
The budget_tokens parameter determines the maximum number of tokens Claude is allowed to use for its internal reasoning process. In Claude 4 and later models, this limit applies to full thinking tokens, and not to the summarized output. Larger budgets can improve response quality by enabling more thorough analysis for complex problems, although Claude may not use the entire budget allocated, especially at ranges above 32k.
budget_tokens is deprecated on Claude Opus 4.6 and will be removed in a future model release. We recommend using adaptive thinking with the effort parameter to control thinking depth instead.
Claude Opus 4.6 supports up to 128K output tokens. Earlier models support up to 64K output tokens.
budget_tokens must be set to a value less than max_tokens. However, when using interleaved thinking with tools, you can exceed this limit as the token limit becomes your entire context window (200k tokens).
With extended thinking enabled, the Messages API for Claude 4 models returns a summary of Claude's full thinking process. Summarized thinking provides the full intelligence benefits of extended thinking, while preventing misuse.
Here are some important considerations for summarized thinking:
Claude Sonnet 3.7 continues to return full thinking output.
In rare cases where you need access to full thinking output for Claude 4 models, contact our sales team.
You can stream extended thinking responses using server-sent events (SSE).
When streaming is enabled for extended thinking, you receive thinking content via thinking_delta events.
For more documention on streaming via the Messages API, see Streaming Messages.
Here's how to handle streaming with thinking:
curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $ANTHROPIC_API_KEY" \
--header "anthropic-version: 2023-06-01" \
--header "content-type: application/json" \
--data \
'{
"model": "claude-sonnet-4-5",
"max_tokens": 16000,
"stream": true,
"thinking": {
"type": "enabled",
"budget_tokens": 10000
},
"messages": [
{
"role": "user",
"content": "What is the greatest common divisor of 1071 and 462?"
}
]
}'Example streaming output:
event: message_start
data: {"type": "message_start", "message": {"id": "msg_01...", "type": "message", "role": "assistant", "content": [], "model": "claude-sonnet-4-5", "stop_reason": null, "stop_sequence": null}}
event: content_block_start
data: {"type": "content_block_start", "index": 0, "content_block": {"type": "thinking", "thinking": ""}}
event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "thinking_delta", "thinking": "I need to find the GCD of 1071 and 462 using the Euclidean algorithm.\n\n1071 = 2 × 462 + 147"}}
event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "thinking_delta", "thinking": "\n462 = 3 × 147 + 21\n147 = 7 × 21 + 0\n\nSo GCD(1071, 462) = 21"}}
// Additional thinking deltas...
event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "signature_delta", "signature": "EqQBCgIYAhIM1gbcDa9GJwZA2b3hGgxBdjrkzLoky3dl1pkiMOYds..."}}
event: content_block_stop
data: {"type": "content_block_stop", "index": 0}
event: content_block_start
data: {"type": "content_block_start", "index": 1, "content_block": {"type": "text", "text": ""}}
event: content_block_delta
data: {"type": "content_block_delta", "index": 1, "delta": {"type": "text_delta", "text": "The greatest common divisor of 1071 and 462 is **21**."}}
// Additional text deltas...
event: content_block_stop
data: {"type": "content_block_stop", "index": 1}
event: message_delta
data: {"type": "message_delta", "delta": {"stop_reason": "end_turn", "stop_sequence": null}}
event: message_stop
data: {"type": "message_stop"}When using streaming with thinking enabled, you might notice that text sometimes arrives in larger chunks alternating with smaller, token-by-token delivery. This is expected behavior, especially for thinking content.
The streaming system needs to process content in batches for optimal performance, which can result in this "chunky" delivery pattern, with possible delays between streaming events. We're continuously working to improve this experience, with future updates focused on making thinking content stream more smoothly.
Extended thinking can be used alongside tool use, allowing Claude to reason through tool selection and results processing.
When using extended thinking with tool use, be aware of the following limitations:
Tool choice limitation: Tool use with thinking only supports tool_choice: {"type": "auto"} (the default) or tool_choice: {"type": "none"}. Using tool_choice: {"type": "any"} or tool_choice: {"type": "tool", "name": "..."} will result in an error because these options force tool use, which is incompatible with extended thinking.
Preserving thinking blocks: During tool use, you must pass thinking blocks back to the API for the last assistant message. Include the complete unmodified block back to the API to maintain reasoning continuity.
You cannot toggle thinking in the middle of an assistant turn, including during tool use loops. The entire assistant turn should operate in a single thinking mode:
From the model's perspective, tool use loops are part of the assistant turn. An assistant turn doesn't complete until Claude finishes its full response, which may include multiple tool calls and results.
For example, this sequence is all part of a single assistant turn:
User: "What's the weather in Paris?"
Assistant: [thinking] + [tool_use: get_weather]
User: [tool_result: "20°C, sunny"]
Assistant: [text: "The weather in Paris is 20°C and sunny"]Even though there are multiple API messages, the tool use loop is conceptually part of one continuous assistant response.
When a mid-turn thinking conflict occurs (such as toggling thinking on or off during a tool use loop), the API automatically disables thinking for that request. To preserve model quality and remain on-distribution, the API may:
This means that attempting to toggle thinking mid-turn won't cause an error, but thinking will be silently disabled for that request. To confirm whether thinking was active, check for the presence of thinking blocks in the response.
Best practice: Plan your thinking strategy at the start of each turn rather than trying to toggle mid-turn.
Example: Toggling thinking after completing a turn
User: "What's the weather?"
Assistant: [tool_use] (thinking disabled)
User: [tool_result]
Assistant: [text: "It's sunny"]
User: "What about tomorrow?"
Assistant: [thinking] + [text: "..."] (thinking enabled - new turn)By completing the assistant turn before toggling thinking, you ensure that thinking is actually enabled for the new request.
Toggling thinking modes also invalidates prompt caching for message history. For more details, see the Extended thinking with prompt caching section.
During tool use, you must pass thinking blocks back to the API, and you must include the complete unmodified block back to the API. This is critical for maintaining the model's reasoning flow and conversation integrity.
While you can omit thinking blocks from prior assistant role turns, we suggest always passing back all thinking blocks to the API for any multi-turn conversation. The API will:
When toggling thinking modes during a conversation, remember that the entire assistant turn (including tool use loops) must operate in a single thinking mode. For more details, see Toggling thinking modes in conversations.
When Claude invokes tools, it is pausing its construction of a response to await external information. When tool results are returned, Claude will continue building that existing response. This necessitates preserving thinking blocks during tool use, for a couple of reasons:
Reasoning continuity: The thinking blocks capture Claude's step-by-step reasoning that led to tool requests. When you post tool results, including the original thinking ensures Claude can continue its reasoning from where it left off.
Context maintenance: While tool results appear as user messages in the API structure, they're part of a continuous reasoning flow. Preserving thinking blocks maintains this conceptual flow across multiple API calls. For more information on context management, see our guide on context windows.
Important: When providing thinking blocks, the entire sequence of consecutive thinking blocks must match the outputs generated by the model during the original request; you cannot rearrange or modify the sequence of these blocks.
Extended thinking with tool use in Claude 4 models supports interleaved thinking, which enables Claude to think between tool calls and make more sophisticated reasoning after receiving tool results.
With interleaved thinking, Claude can:
For Claude Opus 4.6, interleaved thinking is automatically enabled when using adaptive thinking — no beta header is needed.
For Claude 4 models, add the beta header interleaved-thinking-2025-05-14 to your API request to enable interleaved thinking.
Here are some important considerations for interleaved thinking:
budget_tokens can exceed the max_tokens parameter, as it represents the total budget across all thinking blocks within one assistant turn.interleaved-thinking-2025-05-14.interleaved-thinking-2025-05-14 in requests to any model, with no effect.interleaved-thinking-2025-05-14 to any model aside from Claude Opus 4.6, Claude Opus 4.5, Claude Opus 4.1, Opus 4, or Sonnet 4, your request will fail.Prompt caching with thinking has several important considerations:
Extended thinking tasks often take longer than 5 minutes to complete. Consider using the 1-hour cache duration to maintain cache hits across longer thinking sessions and multi-step workflows.
Thinking block context removal
Cache invalidation patterns
While thinking blocks are removed for caching and context calculations, they must be preserved when continuing conversations with tool use, especially with interleaved thinking.
When using extended thinking with tool use, thinking blocks exhibit specific caching behavior that affects token counting:
How it works:
Detailed example flow:
Request 1:
User: "What's the weather in Paris?"Response 1:
[thinking_block_1] + [tool_use block 1]Request 2:
User: ["What's the weather in Paris?"],
Assistant: [thinking_block_1] + [tool_use block 1],
User: [tool_result_1, cache=True]Response 2:
[thinking_block_2] + [text block 2]Request 2 writes a cache of the request content (not the response). The cache includes the original user message, the first thinking block, tool use block, and the tool result.
Request 3:
User: ["What's the weather in Paris?"],
Assistant: [thinking_block_1] + [tool_use block 1],
User: [tool_result_1, cache=True],
Assistant: [thinking_block_2] + [text block 2],
User: [Text response, cache=True]For Claude Opus 4.5 and later (including Claude Opus 4.6), all previous thinking blocks are kept by default. For older models, because a non-tool-result user block was included, all previous thinking blocks are ignored. This request will be processed the same as:
User: ["What's the weather in Paris?"],
Assistant: [tool_use block 1],
User: [tool_result_1, cache=True],
Assistant: [text block 2],
User: [Text response, cache=True]Key points:
cache_control markersIn older Claude models (prior to Claude Sonnet 3.7), if the sum of prompt tokens and max_tokens exceeded the model's context window, the system would automatically adjust max_tokens to fit within the context limit. This meant you could set a large max_tokens value and the system would silently reduce it as needed.
With Claude 3.7 and 4 models, max_tokens (which includes your thinking budget when thinking is enabled) is enforced as a strict limit. The system will now return a validation error if prompt tokens + max_tokens exceeds the context window size.
You can read through our guide on context windows for a more thorough deep dive.
When calculating context window usage with thinking enabled, there are some considerations to be aware of:
max_tokens limit for that turnThe diagram below demonstrates the specialized token management when extended thinking is enabled:
The effective context window is calculated as:
context window =
(current input tokens - previous thinking tokens) +
(thinking tokens + encrypted thinking tokens + text output tokens)We recommend using the token counting API to get accurate token counts for your specific use case, especially when working with multi-turn conversations that include thinking.
When using extended thinking with tool use, thinking blocks must be explicitly preserved and returned with the tool results.
The effective context window calculation for extended thinking with tool use becomes:
context window =
(current input tokens + previous thinking tokens + tool use tokens) +
(thinking tokens + encrypted thinking tokens + text output tokens)The diagram below illustrates token management for extended thinking with tool use:
Given the context window and max_tokens behavior with extended thinking Claude 3.7 and 4 models, you may need to:
max_tokens values as your prompt length changesThis change has been made to provide more predictable and transparent behavior, especially as maximum token limits have increased significantly.
Full thinking content is encrypted and returned in the signature field. This field is used to verify that thinking blocks were generated by Claude when passed back to the API.
It is only strictly necessary to send back thinking blocks when using tools with extended thinking. Otherwise you can omit thinking blocks from previous turns, or let the API strip them for you if you pass them back.
If sending back thinking blocks, we recommend passing everything back as you received it for consistency and to avoid potential issues.
Here are some important considerations on thinking encryption:
signature_delta inside a content_block_delta event just before the content_block_stop event.signature values are significantly longer in Claude 4 models than in previous models.signature field is an opaque field and should not be interpreted or parsed - it exists solely for verification purposes.signature values are compatible across platforms (Claude APIs, Amazon Bedrock, and Vertex AI). Values generated on one platform will be compatible with another.Occasionally Claude's internal reasoning will be flagged by our safety systems. When this occurs, we encrypt some or all of the thinking block and return it to you as a redacted_thinking block. redacted_thinking blocks are decrypted when passed back to the API, allowing Claude to continue its response without losing context.
When building customer-facing applications that use extended thinking:
Here's an example showing both normal and redacted thinking blocks:
{
"content": [
{
"type": "thinking",
"thinking": "Let me analyze this step by step...",
"signature": "WaUjzkypQ2mUEVM36O2TxuC06KN8xyfbJwyem2dw3URve/op91XWHOEBLLqIOMfFG/UvLEczmEsUjavL...."
},
{
"type": "redacted_thinking",
"data": "EmwKAhgBEgy3va3pzix/LafPsn4aDFIT2Xlxh0L5L8rLVyIwxtE3rAFBa8cr3qpPkNRj2YfWXGmKDxH4mPnZ5sQ7vB9URj2pLmN3kF8/dW5hR7xJ0aP1oLs9yTcMnKVf2wRpEGjH9XZaBt4UvDcPrQ..."
},
{
"type": "text",
"text": "Based on my analysis..."
}
]
}Seeing redacted thinking blocks in your output is expected behavior. The model can still use this redacted reasoning to inform its responses while maintaining safety guardrails.
If you need to test redacted thinking handling in your application, you can use this special test string as your prompt: ANTHROPIC_MAGIC_STRING_TRIGGER_REDACTED_THINKING_46C9A13E193C177646C7398A98432ECCCE4C1253D5E2D82641AC0E52CC2876CB
When passing thinking and redacted_thinking blocks back to the API in a multi-turn conversation, you must include the complete unmodified block back to the API for the last assistant turn. This is critical for maintaining the model's reasoning flow. We suggest always passing back all thinking blocks to the API. For more details, see the Preserving thinking blocks section.
The Messages API handles thinking differently across Claude Sonnet 3.7 and Claude 4 models, primarily in redaction and summarization behavior.
See the table below for a condensed comparison:
| Feature | Claude Sonnet 3.7 | Claude 4 Models (pre-Opus 4.5) | Claude Opus 4.5 | Claude Opus 4.6 (adaptive thinking) |
|---|---|---|---|---|
| Thinking Output | Returns full thinking output | Returns summarized thinking | Returns summarized thinking | Returns summarized thinking |
| Interleaved Thinking | Not supported | Supported with interleaved-thinking-2025-05-14 beta header | Supported with interleaved-thinking-2025-05-14 beta header | Automatic with adaptive thinking (no beta header needed) |
| Thinking Block Preservation | Not preserved across turns | Not preserved across turns | Preserved by default | Preserved by default |
Starting with Claude Opus 4.5 (and continuing in Claude Opus 4.6), thinking blocks from previous assistant turns are preserved in model context by default. This differs from earlier models, which remove thinking blocks from prior turns.
Benefits of thinking block preservation:
Important considerations:
For earlier models (Claude Sonnet 4.5, Opus 4.1, etc.), thinking blocks from previous turns continue to be removed from context. The existing behavior described in the Extended thinking with prompt caching section applies to those models.
For complete pricing information including base rates, cache writes, cache hits, and output tokens, see the pricing page.
The thinking process incurs charges for:
When extended thinking is enabled, a specialized system prompt is automatically included to support this feature.
When using summarized thinking:
The billed output token count will not match the visible token count in the response. You are billed for the full thinking process, not the summary you see.
max_tokens is greater than 21,333 to avoid HTTP timeouts on long-running requests. This is a client-side validation, not an API restriction. If you don't need to process events incrementally, use .stream() with .get_final_message() (Python) or .finalMessage() (TypeScript) to get the complete Message object without handling individual events — see Streaming Messages for details. When streaming, be prepared to handle both thinking and text content blocks as they arrive.temperature or top_k modifications as well as forced tool use.top_p to values between 1 and 0.95.Explore practical examples of thinking in our cookbook.
Learn prompt engineering best practices for extended thinking.
Was this page helpful?