Was this page helpful?
This feature is eligible for Zero Data Retention (ZDR). When your organization has a ZDR arrangement, data sent through this feature is not stored after the API response is returned.
伺服器端壓縮是管理長期運行對話和代理工作流程中上下文的推薦策略。它以最少的整合工作自動處理上下文管理。
壓縮通過在接近上下文視窗限制時自動總結較舊的上下文,延長了長期運行對話和任務的有效上下文長度。這不僅僅是關於保持在令牌上限以下。隨著對話變得更長,模型在維持整個歷史記錄的焦點方面會遇到困難。壓縮通過用簡潔的摘要替換陳舊內容,保持活躍上下文的焦點和性能。
如需深入了解為什麼長上下文會降級以及壓縮如何幫助,請參閱 有效的上下文工程。
這適合用於:
壓縮處於測試版。在您的 API 請求中包含測試版標頭 compact-2026-01-12 以使用此功能。
壓縮在以下模型上受支援:
claude-mythos-preview)claude-opus-4-7)claude-opus-4-6)claude-sonnet-4-6)啟用壓縮後,當 Claude 接近配置的令牌閾值時,會自動總結您的對話。API 會:
compaction 區塊。在後續請求中,將回應附加到您的訊息。API 會自動刪除 compaction 區塊之前的所有訊息區塊,從摘要繼續對話。
通過在 Messages API 請求中將 compact_20260112 策略新增到 context_management.edits 來啟用壓縮。
| 參數 | 類型 | 預設值 | 描述 |
|---|---|---|---|
type | string | 必需 | 必須是 "compact_20260112" |
trigger | object | 150,000 令牌 | 何時觸發壓縮。必須至少 50,000 令牌。 |
pause_after_compaction | boolean | false | 生成壓縮摘要後是否暫停 |
instructions | string | null | 自訂總結提示。提供時完全替換預設提示。 |
使用 trigger 參數配置何時觸發壓縮:
預設情況下,壓縮使用以下總結提示:
You have written a partial transcript for the initial task above. Please write a summary of the transcript. The purpose of this summary is to provide continuity so you can continue to make progress towards solving the task in a future context, where the raw history above may not be accessible and will be replaced with this summary. Write down anything that would be helpful, including the state, next steps, learnings etc. You must wrap your summary in a <summary></summary> block.您可以通過 instructions 參數提供自訂指示以完全替換此提示。自訂指示不補充預設值;它們完全替換它:
使用 pause_after_compaction 在 API 生成壓縮摘要後暫停。這允許您在 API 繼續回應之前添加額外的內容塊(例如保留最近的消息或特定的指令導向消息)。
啟用後,API 會在生成壓縮塊後返回一條具有 compaction 停止原因的消息:
當模型在具有許多工具使用迭代的長任務上工作時,總令牌消耗可能會顯著增長。您可以將 pause_after_compaction 與壓縮計數器結合使用,以估計累積使用情況,並在達到預算後優雅地完成任務:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
TRIGGER_THRESHOLD = 100_000
TOTAL_TOKEN_BUDGET = 3_000_000
n_compactions = 0
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": TRIGGER_THRESHOLD},
"pause_after_compaction": True,
}
]
},
)
if response.stop_reason == "compaction":
n_compactions += 1
messages.append({"role": "assistant", "content": response.content})
# Estimate total tokens consumed; prompt wrap-up if over budget
if n_compactions * TRIGGER_THRESHOLD >= TOTAL_TOKEN_BUDGET:
messages.append(
{
"role": "user",
"content": "Please wrap up your current work and summarize the final state.",
}
)當觸發壓縮時,API 會在助手回應的開始處返回一個 compaction 塊。
長時間運行的對話可能會導致多次壓縮。最後的壓縮塊反映了提示的最終狀態,用生成的摘要替換其之前的內容。
{
"content": [
{
"type": "compaction",
"content": "Summary of the conversation: The user requested help building a web scraper..."
},
{
"type": "text",
"text": "Based on our conversation so far..."
}
]
}您必須在後續請求中將 compaction 塊傳遞回 API,以繼續使用縮短的提示進行對話。最簡單的方法是將整個回應內容附加到您的消息中:
當 API 接收到 compaction 塊時,其之前的所有內容塊都會被忽略。您可以:
當啟用壓縮的流式回應時,您會在壓縮開始時收到 content_block_start 事件。壓縮塊的流式傳輸方式與文本塊不同。您會收到一個 content_block_start 事件,然後是一個包含完整摘要內容的單個 content_block_delta(無中間流式傳輸),最後是一個 content_block_stop 事件。
壓縮與提示詞快取配合效果很好。您可以在壓縮區塊上新增 cache_control 中斷點來快取摘要內容。原始的壓縮內容會被忽略。
{
"role": "assistant",
"content": [
{
"type": "compaction",
"content": "[summary text]",
"cache_control": { "type": "ephemeral" }
},
{
"type": "text",
"text": "Based on our conversation..."
}
]
}當壓縮發生時,摘要會成為需要寫入快取的新內容。如果沒有額外的快取中斷點,這也會使任何快取的系統提示詞失效,需要與壓縮摘要一起重新快取。
為了最大化快取命中率,在系統提示詞的末尾新增 cache_control 中斷點。這樣可以將系統提示詞與對話分開快取,因此當壓縮發生時:
這種方法對於長系統提示詞特別有益,因為即使在對話中發生多次壓縮事件,它們也會保持快取狀態。
壓縮需要額外的採樣步驟,這會對速率限制和計費產生影響。API 在回應中返回詳細的使用資訊:
{
"usage": {
"input_tokens": 23000,
"output_tokens": 1000,
"iterations": [
{
"type": "compaction",
"input_tokens": 180000,
"output_tokens": 3500
},
{
"type": "message",
"input_tokens": 23000,
"output_tokens": 1000
}
]
}
}iterations 陣列顯示每個採樣迭代的使用情況。當壓縮發生時,您會看到一個 compaction 迭代,然後是主要的 message 迭代。在此範例中,頂級 input_tokens 和 output_tokens 與 message 迭代完全相符,因為只有一個非壓縮迭代。最終迭代的令牌計數反映壓縮後的有效內容大小。
頂級 input_tokens 和 output_tokens 不包括壓縮迭代使用情況。它們反映所有非壓縮迭代的總和。要計算請求消耗和計費的總令牌數,請對 usage.iterations 陣列中的所有項目求和。
如果您之前依賴 usage.input_tokens 和 usage.output_tokens 進行成本追蹤或審計,當啟用壓縮時,您需要更新追蹤邏輯以在 usage.iterations 中進行彙總。iterations 陣列僅在請求期間觸發新壓縮時填充。重新應用先前的 compaction 區塊不會產生額外的壓縮成本,在這種情況下頂級使用欄位保持準確。
使用伺服器工具(如網路搜尋)時,壓縮觸發器在每個採樣迭代開始時檢查。根據您的觸發閾值和產生的輸出量,壓縮可能在單個請求中發生多次。
令牌計數端點(/v1/messages/count_tokens)應用提示詞中現有的 compaction 區塊,但不觸發新的壓縮。使用它來檢查先前壓縮後的有效令牌計數:
以下是一個完整的長期對話與壓縮的範例:
以下是一個使用 pause_after_compaction 的範例,用於保留先前的交換和目前的使用者訊息(總共三個訊息)逐字,而不是進行摘要:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Help me build a website"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# Append the response (including any compaction block) to continue the conversation
messages.append({"role": "assistant", "content": response.content})client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 150000},
}
]
},
)client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"instructions": "Focus on preserving code snippets, variable names, and technical decisions.",
}
]
},
)client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [{"type": "compact_20260112", "pause_after_compaction": True}]
},
)
# Check if compaction triggered a pause
if response.stop_reason == "compaction":
# Response contains only the compaction block
messages.append({"role": "assistant", "content": response.content})
# Continue the request
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# After receiving a response with a compaction block
messages.append({"role": "assistant", "content": response.content})
# Continue the conversation
messages.append({"role": "user", "content": "Now add error handling"})
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
with client.beta.messages.stream(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
) as stream:
for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "compaction":
print("Compaction started...")
elif event.content_block.type == "text":
print("Text response started...")
elif event.type == "content_block_delta":
if event.delta.type == "compaction_delta":
print(f"Compaction complete: {len(event.delta.content)} chars")
elif event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
# Get the final accumulated message
message = stream.get_final_message()
messages.append({"role": "assistant", "content": message.content})client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
system=[
{
"type": "text",
"text": "You are a helpful coding assistant...",
"cache_control": {
"type": "ephemeral"
}, # Cache the system prompt separately
}
],
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
count_response = client.beta.messages.count_tokens(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
print(f"Current tokens: {count_response.input_tokens}")
print(f"Original tokens: {count_response.context_management.original_input_tokens}")client = anthropic.Anthropic()
messages: list[dict] = []
def chat(user_message: str) -> str:
messages.append({"role": "user", "content": user_message})
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 100000},
}
]
},
)
# Append response (compaction blocks are automatically included)
messages.append({"role": "assistant", "content": response.content})
# Return the text content
return next(block.text for block in response.content if block.type == "text")
# Run a long conversation
print(chat("Help me build a Python web scraper"))
print(chat("Add support for JavaScript-rendered pages"))
print(chat("Now add rate limiting and error handling"))
# ... continue as long as neededfrom typing import Any
client = anthropic.Anthropic()
messages: list[dict[str, Any]] = []
def chat(user_message: str) -> str:
messages.append({"role": "user", "content": user_message})
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 100000},
"pause_after_compaction": True,
}
]
},
)
# Check if compaction occurred and paused
if response.stop_reason == "compaction":
# Get the compaction block from the response
compaction_block = response.content[0]
# Preserve the prior exchange + current user message (3 messages)
# by including them after the compaction block
preserved_messages = messages[-3:] if len(messages) >= 3 else messages
# Build new message list: compaction + preserved messages
new_assistant_content = [compaction_block]
messages_after_compaction = [
{"role": "assistant", "content": new_assistant_content}
] + preserved_messages
# Continue the request with the compacted context + preserved messages
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages_after_compaction,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# Update our message list to reflect the compaction
messages.clear()
messages.extend(messages_after_compaction)
# Append the final response
messages.append({"role": "assistant", "content": response.content})
# Return the text content
return next(block.text for block in response.content if block.type == "text")
# Run a long conversation
print(chat("Help me build a Python web scraper"))
print(chat("Add support for JavaScript-rendered pages"))
print(chat("Now add rate limiting and error handling"))
# ... continue as long as needed探索其他管理對話上下文的策略,例如工具結果清除和思考區塊清除。