The advisor tool lets a faster, lower-cost executor model consult a higher-intelligence advisor model mid-generation for strategic guidance. The advisor reads the full conversation, produces a plan or course correction (typically 400 to 700 text tokens, 1,400 to 1,800 tokens total including thinking), and the executor continues with the task.
This pattern fits long-horizon agentic workloads (coding agents, computer use, multi-step research pipelines) where most turns are mechanical but having an excellent plan is crucial. You get close to advisor-solo quality while the bulk of token generation happens at executor-model rates.
The advisor tool is in beta. Include the beta header advisor-tool-2026-03-01
in your requests. To request access or share feedback, contact your Anthropic
account team.
This feature is eligible for Zero Data Retention (ZDR). When your organization has a ZDR arrangement, data sent through this feature is not stored after the API response is returned.
Early benchmarks show meaningful gains for these configurations:
Results are task-dependent. Evaluate on your own workload.
The advisor is a weaker fit for single-turn Q&A (nothing to plan), pure pass-through model pickers where your users already choose their own cost and quality tradeoff, or workloads where every turn genuinely requires the advisor model's full capability.
The executor model (the top-level model field) and the advisor model (the model field inside the tool definition) must form a valid pair. The advisor must be at least as capable as the executor.
| Executor models | Advisor models |
|---|---|
Claude Haiku 4.5 (claude-haiku-4-5-20251001) | Claude Opus 4.6 (claude-opus-4-6) |
Claude Sonnet 4.6 (claude-sonnet-4-6) | Claude Opus 4.6 (claude-opus-4-6) |
Claude Opus 4.6 (claude-opus-4-6) | Claude Opus 4.6 (claude-opus-4-6) |
If you request an invalid pair, the API returns a 400 invalid_request_error naming the unsupported combination.
The advisor tool is available in beta on the Claude API (Anthropic).
When you add the advisor tool to your tools array, the executor model decides when to call it, just like any other tool. When the executor invokes the advisor:
server_tool_use block with name: "advisor" and an empty input. The executor signals timing; the server supplies context.advisor_tool_result block.All of this happens inside a single /v1/messages request. No extra round trips on your side.
The advisor itself runs without tools and without context management. Its thinking blocks are dropped before the result returns; only the advice text reaches the executor.
| Parameter | Type | Default | Description |
|---|---|---|---|
type | string | required | Must be "advisor_20260301". |
name | string | required | Must be "advisor". |
model | string | required | The advisor model ID, such as "claude-opus-4-6". Billed at this model's rates for the sub-inference. |
max_uses | integer | unlimited | Maximum number of advisor calls allowed in a single request. Once the executor reaches this cap, further advisor calls return an with and the executor continues without further advice. This is a per-request cap, not a per-conversation cap; see for conversation-level limits. |
The caching object has the shape {"type": "ephemeral", "ttl": "5m" | "1h"}. Unlike cache_control on content blocks, this is not a breakpoint marker; it is an on/off switch. The server decides where cache boundaries go.
When the advisor is invoked, a server_tool_use block is followed by an advisor_tool_result block in the assistant's content:
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Let me consult the advisor on this."
},
{
"type": "server_tool_use",
"id": "srvtoolu_abc123",
"name": "advisor",
"input": {}
},
{
"type": "advisor_tool_result",
"tool_use_id": "srvtoolu_abc123",
"content": {
"type": "advisor_result",
"text": "Use a channel-based coordination pattern. The tricky part is draining in-flight work during shutdown: close the input channel first, then wait on a WaitGroup..."
}
},
{
"type": "text",
"text": "Here's the implementation. I'm using a channel-based coordination pattern to avoid writer starvation..."
}
]
}The server_tool_use.input is always empty. The server constructs the advisor's view from the full transcript automatically; nothing the executor puts in input reaches the advisor.
The advisor_tool_result.content field is a discriminated union. Which variant you receive depends on the advisor model:
| Variant | Fields | Returned when |
|---|---|---|
advisor_result | text | The advisor model returns plaintext (for example, Claude Opus 4.6). |
advisor_redacted_result | encrypted_content | The advisor model returns encrypted output. |
With advisor_result, the text field contains human-readable advice. With advisor_redacted_result, the encrypted_content field contains an opaque blob that you cannot read; on the next turn, the server decrypts it and renders the plaintext into the executor's prompt.
In both cases, round-trip the content verbatim on subsequent turns. If you switch advisor models mid-conversation, branch on content.type to handle both shapes.
If the advisor call fails, the result carries an error:
{
"type": "advisor_tool_result",
"tool_use_id": "srvtoolu_abc123",
"content": {
"type": "advisor_tool_result_error",
"error_code": "overloaded"
}
}The executor sees the error and continues without further advice. The request itself does not fail.
error_code | Meaning |
|---|---|
max_uses_exceeded | The request reached the max_uses cap set on the tool definition. Further advisor calls in the same request return this error. |
too_many_requests | The advisor sub-inference was rate-limited. |
overloaded | The advisor sub-inference hit capacity limits. |
prompt_too_long | The transcript exceeded the advisor model's context window. |
execution_time_exceeded | The advisor sub-inference timed out. |
unavailable | Any other advisor failure. |
Advisor rate limits draw from the same per-model bucket as direct calls to the advisor model. A rate limit on the advisor appears as too_many_requests inside the tool result; a rate limit on the executor fails the whole request with HTTP 429.
Pass the full assistant content, including advisor_tool_result blocks, back to the API on subsequent turns:
import anthropic
client = anthropic.Anthropic()
tools = [
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
]
messages = [
{
"role": "user",
"content": "Build a concurrent worker pool in Go with graceful shutdown.",
}
]
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
betas=["advisor-tool-2026-03-01"],
tools=tools,
messages=messages,
)
# Append the full response content, including any advisor_tool_result blocks
messages.append({"role": "assistant", "content": response.content})
# Continue the conversation
messages.append({"role": "user", "content": "Now add a max-in-flight limit of 10."})
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
betas=["advisor-tool-2026-03-01"],
tools=tools,
messages=messages,
)If you omit the advisor tool from tools on a follow-up turn while the message history still contains advisor_tool_result blocks, the API returns a 400 invalid_request_error.
The advisor tool has no built-in conversation-level cap. To limit advisor
calls across a conversation, count them client-side. When you reach your
ceiling, remove the advisor tool from your tools array and strip all
advisor_tool_result blocks from your message history to avoid a
400 invalid_request_error.
The advisor sub-inference does not stream. The executor's stream pauses while the advisor runs, then the full result arrives in a single event.
The server_tool_use block with name: "advisor" signals that an advisor call is starting. The pause begins when that block closes (content_block_stop). During the pause, the stream is quiet except for standard SSE ping keepalives emitted roughly every 30 seconds; short advisor calls may show no pings.
When the advisor finishes, the advisor_tool_result arrives fully formed in a single content_block_start event (no deltas). Executor output then resumes streaming.
A message_delta event follows with the updated usage.iterations array reflecting the advisor's token counts.
Advisor calls run as a separate sub-inference billed at the advisor model's rates. Usage is reported in the usage.iterations[] array:
{
"usage": {
"input_tokens": 412,
"cache_read_input_tokens": 0,
"cache_creation_input_tokens": 0,
"output_tokens": 531,
"iterations": [
{
"type": "message",
"input_tokens": 412,
"cache_read_input_tokens": 0,
"cache_creation_input_tokens": 0,
"output_tokens": 89
},
{
"type": "advisor_message",
"model": "claude-opus-4-6",
"input_tokens": 823,
"cache_read_input_tokens": 0,
"cache_creation_input_tokens": 0,
"output_tokens": 1612
},
{
"type": "message",
"input_tokens": 1348,
"cache_read_input_tokens": 412,
"cache_creation_input_tokens": 0,
"output_tokens": 442
}
]
}
}Top-level usage fields reflect executor tokens only. Advisor tokens are not rolled into the top-level totals because they are billed at a different rate. Iterations with type: "advisor_message" are billed at the advisor model's rates; iterations with type: "message" are billed at the executor model's rates.
The aggregation rules differ by field. Top-level output_tokens is the sum of all executor iterations. Top-level input_tokens and cache_read_input_tokens reflect the first executor iteration only; subsequent executor iterations' inputs are not re-summed because they include prior output tokens. Use usage.iterations for a full per-iteration breakdown when building cost-tracking logic.
Advisor output is typically 400 to 700 text tokens, or 1,400 to 1,800 tokens total including thinking. The cost savings come from the advisor not generating your full final output; the executor does that at its lower rate.
The top-level max_tokens applies to executor output only. It does not bound advisor sub-inference tokens. The advisor's tokens also do not draw from any task budget applied to the executor.
There are two independent caching layers.
The advisor_tool_result block is cacheable like any other content block. A cache_control breakpoint placed after it on a subsequent turn will hit. The executor's prompt always contains the plaintext advice regardless of whether your client received text or encrypted_content, so caching behavior is identical for both result variants.
Set caching on the tool definition to enable prompt caching for the advisor's own transcript across calls within the same conversation:
tools = [
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
"caching": {"type": "ephemeral", "ttl": "5m"},
}
]The advisor's prompt on the Nth call is the (N-1)th call's prompt with one more segment appended, so the prefix is stable across calls. With caching enabled, each advisor call writes a cache entry; the next call reads up to that point and pays only for the delta. You'll see cache_read_input_tokens become nonzero on the second and later advisor_message iterations.
When to enable it: The cache write costs more than the reads save when the advisor is called two or fewer times per conversation. Caching breaks even at roughly three advisor calls and improves from there. Enable it for long agent loops; keep it off for short tasks.
Keep it consistent: Set caching once and leave it for the whole conversation. Toggling it off and on mid-conversation causes cache misses.
clear_thinking with a keep
value other than "all" shifts the advisor's quoted transcript each turn,
causing advisor-side cache misses. This is a cost degradation only; advice
quality is unaffected. When extended thinking is enabled without explicit
clear_thinking configuration, the API defaults to
keep: {type: "thinking_turns", value: 1}, which triggers this behavior.
Set keep: "all" to preserve advisor cache stability.
The advisor tool composes with other server-side and client-side tools. Add them all to the same tools array:
tools = [
{
"type": "web_search_20250305",
"name": "web_search",
"max_uses": 5,
},
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
},
{
"name": "run_bash",
"description": "Run a bash command",
"input_schema": {
"type": "object",
"properties": {"command": {"type": "string"}},
},
},
]The executor can search the web, call the advisor, and use your custom tools in the same turn. The advisor's plan can inform which tools the executor reaches for next.
| Feature | Interaction |
|---|---|
| Batch processing | Supported. usage.iterations is reported per item. |
| Token counting | Returns the executor's first-iteration input tokens only. For a rough advisor estimate, call count_tokens with model set to the advisor model and the same messages. |
| Context editing | clear_tool_uses is not yet fully compatible with advisor tool blocks; full support is planned for a follow-up release. With clear_thinking, see the caching warning above. |
pause_turn | A dangling advisor call ends the response with stop_reason: "pause_turn" and the server_tool_use block as the last content block. The advisor executes on resumption. See Server tools. |
The advisor tool ships with a built-in description that nudges the executor to call it near the start of complex tasks and when it hits difficulty. For research tasks, no additional prompting is typically needed.
On coding and agent tasks, the advisor produces higher intelligence at similar cost when it reduces total tool calls and conversation length. Two timings drive this improvement:
If your agent exposes other planner-like tools (for example, a todo list tool), prompt the model to call the advisor before those tools so the advisor's plan funnels into them. The suggested system prompt below reinforces the early-call pattern; add your own funnel-in sentence pointing at whichever planner tools your agent exposes.
For coding tasks where you want consistent advisor timing and around two to three calls per task, prepend the following blocks to your executor system prompt before any other sentences that mention the advisor. On internal coding evaluations this pattern produced the highest intelligence at near-Sonnet cost.
Timing guidance:
You have access to an `advisor` tool backed by a stronger reviewer model. It takes NO parameters — when you call advisor(), your entire conversation history is automatically forwarded. They see the task, every tool call you've made, every result you've seen.
Call advisor BEFORE substantive work — before writing, before committing to an interpretation, before building on an assumption. If the task requires orientation first (finding files, fetching a source, seeing what's there), do that, then call advisor. Orientation is not substantive work. Writing, editing, and declaring an answer are.
Also call advisor:
- When you believe the task is complete. BEFORE this call, make your deliverable durable: write the file, save the result, commit the change. The advisor call takes time; if the session ends during it, a durable result persists and an unwritten one doesn't.
- When stuck — errors recurring, approach not converging, results that don't fit.
- When considering a change of approach.
On tasks longer than a few steps, call advisor at least once before committing to an approach and once before declaring done. On short reactive tasks where the next action is dictated by tool output you just read, you don't need to keep calling — the advisor adds most of its value on the first call, before the approach crystallizes.How the executor should treat the advice (place directly after the timing block):
Give the advice serious weight. If you follow a step and it fails empirically, or you have primary-source evidence that contradicts a specific claim (the file says X, the paper states Y), adapt. A passing self-test is not evidence the advice is wrong — it's evidence your test doesn't check what the advice is checking.
If you've already retrieved data pointing one way and the advisor points another: don't silently switch. Surface the conflict in one more advisor call — "I found X, you suggest Y, which constraint breaks the tie?" The advisor saw your evidence but may have underweighted it; a reconcile call is cheaper than committing to the wrong branch.Advisor output is the advisor's largest cost driver. To reduce that cost, prepend a single conciseness instruction to the system prompt before any other sentence that mentions the advisor. In internal testing, the following line cut total advisor output tokens by roughly 35 to 45 percent without changing call frequency:
The advisor should respond in under 100 words and use enumerated steps, not explanations.Pair this with the timing block above for the strongest cost-versus-quality tradeoff.
For coding tasks, pairing a Sonnet executor at medium effort with an Opus advisor achieves intelligence comparable to Sonnet at default effort, at lower cost. For maximum intelligence, keep the executor at default effort.
tools and strip all advisor_tool_result blocks from your message history to avoid a 400 invalid_request_error.caching only for conversations where you expect three or more advisor calls.max_tokens applies to executor output only. It does not bound advisor tokens.Was this page helpful?
curl https://api.anthropic.com/v1/messages \
--header "x-api-key: $ANTHROPIC_API_KEY" \
--header "anthropic-version: 2023-06-01" \
--header "anthropic-beta: advisor-tool-2026-03-01" \
--header "content-type: application/json" \
--data '{
"model": "claude-sonnet-4-6",
"max_tokens": 4096,
"tools": [
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6"
}
],
"messages": [{
"role": "user",
"content": "Build a concurrent worker pool in Go with graceful shutdown."
}]
}'advisor_tool_result_errorerror_code: "max_uses_exceeded"caching | object | null | null (off) | Enables prompt caching for the advisor's own transcript across calls within a conversation. See Advisor prompt caching. |