This feature is eligible for Zero Data Retention (ZDR). When your organization has a ZDR arrangement, data sent through this feature is not stored after the API response is returned.
服务器端压缩是管理长期运行对话和代理工作流中上下文的推荐策略。它以最少的集成工作自动处理上下文管理。
压缩通过在接近上下文窗口限制时自动总结较旧的上下文,扩展了长期运行对话和任务的有效上下文长度。这不仅仅是关于保持在令牌上限以下。随着对话变得更长,模型在维持整个历史记录的焦点方面会遇到困难。压缩通过用简洁的摘要替换陈旧的内容,使活跃上下文保持专注和高效。
有关为什么长上下文会降级以及压缩如何帮助的更深入了解,请参阅 有效的上下文工程。
这非常适合:
压缩处于测试版。在您的 API 请求中包含测试版标头 compact-2026-01-12 以使用此功能。
压缩在以下模型上受支持:
claude-mythos-preview)claude-opus-4-7)claude-opus-4-6)claude-sonnet-4-6)启用压缩后,当 Claude 接近配置的令牌阈值时,它会自动总结您的对话。API:
compaction 块。在后续请求中,将响应附加到您的消息。API 会自动删除 compaction 块之前的所有消息块,从摘要继续对话。
通过将 compact_20260112 策略添加到 Messages API 请求中的 context_management.edits 来启用压缩。
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Help me build a website"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# Append the response (including any compaction block) to continue the conversation
messages.append({"role": "assistant", "content": response.content})| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
type | string | 必需 | 必须是 "compact_20260112" |
trigger | object | 150,000 令牌 | 何时触发压缩。必须至少 50,000 令牌。 |
pause_after_compaction | boolean | false | 生成压缩摘要后是否暂停 |
instructions | string | null | 自定义总结提示。提供时完全替换默认提示。 |
使用 trigger 参数配置何时触发压缩:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 150000},
}
]
},
)默认情况下,压缩使用以下总结提示:
You have written a partial transcript for the initial task above. Please write a summary of the transcript. The purpose of this summary is to provide continuity so you can continue to make progress towards solving the task in a future context, where the raw history above may not be accessible and will be replaced with this summary. Write down anything that would be helpful, including the state, next steps, learnings etc. You must wrap your summary in a <summary></summary> block.您可以通过 instructions 参数提供自定义说明以完全替换此提示。自定义说明不会补充默认值;它们完全替换它:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"instructions": "Focus on preserving code snippets, variable names, and technical decisions.",
}
]
},
)使用 pause_after_compaction 在 API 生成压缩摘要后暂停。这允许你在 API 继续响应之前添加额外的内容块(例如保留最近的消息或特定的面向指令的消息)。
启用后,API 在生成压缩块后返回一条具有 compaction 停止原因的消息:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [{"type": "compact_20260112", "pause_after_compaction": True}]
},
)
# Check if compaction triggered a pause
if response.stop_reason == "compaction":
# Response contains only the compaction block
messages.append({"role": "assistant", "content": response.content})
# Continue the request
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)当模型处理具有许多工具使用迭代的长任务时,总令牌消耗可能会显著增长。你可以将 pause_after_compaction 与压缩计数器结合使用,以估计累积使用情况,并在达到预算后优雅地完成任务:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
TRIGGER_THRESHOLD = 100_000
TOTAL_TOKEN_BUDGET = 3_000_000
n_compactions = 0
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": TRIGGER_THRESHOLD},
"pause_after_compaction": True,
}
]
},
)
if response.stop_reason == "compaction":
n_compactions += 1
messages.append({"role": "assistant", "content": response.content})
# Estimate total tokens consumed; prompt wrap-up if over budget
if n_compactions * TRIGGER_THRESHOLD >= TOTAL_TOKEN_BUDGET:
messages.append(
{
"role": "user",
"content": "Please wrap up your current work and summarize the final state.",
}
)触发压缩时,API 在助手响应的开始处返回一个 compaction 块。
长时间运行的对话可能会导致多次压缩。最后的压缩块反映了提示的最终状态,用生成的摘要替换其之前的内容。
{
"content": [
{
"type": "compaction",
"content": "Summary of the conversation: The user requested help building a web scraper..."
},
{
"type": "text",
"text": "Based on our conversation so far..."
}
]
}你必须在后续请求中将 compaction 块传递回 API,以继续使用缩短的提示进行对话。最简单的方法是将整个响应内容附加到你的消息中:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# After receiving a response with a compaction block
messages.append({"role": "assistant", "content": response.content})
# Continue the conversation
messages.append({"role": "user", "content": "Now add error handling"})
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)当 API 接收到 compaction 块时,其之前的所有内容块都会被忽略。你可以:
使用启用的压缩流式传输响应时,当压缩开始时你会收到一个 content_block_start 事件。压缩块的流式传输方式与文本块不同。你会收到一个 content_block_start 事件,然后是一个包含完整摘要内容的单个 content_block_delta(无中间流式传输),最后是一个 content_block_stop 事件。
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
with client.beta.messages.stream(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
) as stream:
for event in stream:
if event.type == "content_block_start":
if event.content_block.type == "compaction":
print("Compaction started...")
elif event.content_block.type == "text":
print("Text response started...")
elif event.type == "content_block_delta":
if event.delta.type == "compaction_delta":
print(f"Compaction complete: {len(event.delta.content)} chars")
elif event.delta.type == "text_delta":
print(event.delta.text, end="", flush=True)
# Get the final accumulated message
message = stream.get_final_message()
messages.append({"role": "assistant", "content": message.content})压缩与提示词缓存配合效果很好。您可以在压缩块上添加 cache_control 断点来缓存摘要内容。原始压缩内容将被忽略。
{
"role": "assistant",
"content": [
{
"type": "compaction",
"content": "[summary text]",
"cache_control": { "type": "ephemeral" }
},
{
"type": "text",
"text": "Based on our conversation..."
}
]
}发生压缩时,摘要成为需要写入缓存的新内容。如果没有额外的缓存断点,这也会使任何缓存的系统提示词失效,需要与压缩摘要一起重新缓存。
为了最大化缓存命中率,在系统提示词的末尾添加 cache_control 断点。这样可以将系统提示词缓存与对话分开,因此当压缩发生时:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
system=[
{
"type": "text",
"text": "You are a helpful coding assistant...",
"cache_control": {
"type": "ephemeral"
}, # Cache the system prompt separately
}
],
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)这种方法对于长系统提示词特别有益,因为即使在对话中发生多个压缩事件,它们也会保持缓存。
压缩需要额外的采样步骤,这会影响速率限制和计费。API 在响应中返回详细的使用信息:
{
"usage": {
"input_tokens": 23000,
"output_tokens": 1000,
"iterations": [
{
"type": "compaction",
"input_tokens": 180000,
"output_tokens": 3500
},
{
"type": "message",
"input_tokens": 23000,
"output_tokens": 1000
}
]
}
}iterations 数组显示每个采样迭代的使用情况。当压缩发生时,您会看到一个 compaction 迭代,然后是主要的 message 迭代。在此示例中,顶级 input_tokens 和 output_tokens 与 message 迭代完全匹配,因为只有一个非压缩迭代。最后一个迭代的令牌计数反映压缩后的有效上下文大小。
顶级 input_tokens 和 output_tokens 不包括压缩迭代使用情况。它们反映所有非压缩迭代的总和。要计算请求消耗和计费的总令牌数,请对 usage.iterations 数组中的所有条目求和。
如果您之前依赖 usage.input_tokens 和 usage.output_tokens 进行成本跟踪或审计,当启用压缩时,您需要更新跟踪逻辑以在 usage.iterations 中进行聚合。iterations 数组仅在请求期间触发新压缩时填充。重新应用之前的 compaction 块不会产生额外的压缩成本,在这种情况下顶级使用字段保持准确。
使用服务器工具(如网络搜索)时,压缩触发器在每个采样迭代的开始处检查。根据您的触发阈值和生成的输出量,压缩可能在单个请求中发生多次。
令牌计数端点(/v1/messages/count_tokens)应用提示词中现有的 compaction 块,但不触发新的压缩。使用它来检查之前压缩后的有效令牌计数:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
count_response = client.beta.messages.count_tokens(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
messages=messages,
context_management={"edits": [{"type": "compact_20260112"}]},
)
print(f"Current tokens: {count_response.input_tokens}")
print(f"Original tokens: {count_response.context_management.original_input_tokens}")以下是一个带有压缩的长时间运行对话的完整示例:
client = anthropic.Anthropic()
messages: list[dict] = []
def chat(user_message: str) -> str:
messages.append({"role": "user", "content": user_message})
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 100000},
}
]
},
)
# 追加响应(压缩块会自动包含)
messages.append({"role": "assistant", "content": response.content})
# 返回文本内容
return next(block.text for block in response.content if block.type == "text")
# 运行长对话
print(chat("Help me build a Python web scraper"))
print(chat("Add support for JavaScript-rendered pages"))
print(chat("Now add rate limiting and error handling"))
# ... 根据需要继续以下是一个使用 pause_after_compaction 的示例,用于保留先前的交换和当前用户消息(总共三条消息)的逐字记录,而不是对其进行总结:
from typing import Any
client = anthropic.Anthropic()
messages: list[dict[str, Any]] = []
def chat(user_message: str) -> str:
messages.append({"role": "user", "content": user_message})
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages,
context_management={
"edits": [
{
"type": "compact_20260112",
"trigger": {"type": "input_tokens", "value": 100000},
"pause_after_compaction": True,
}
]
},
)
# 检查是否发生了压缩并暂停
if response.stop_reason == "compaction":
# 从响应中获取压缩块
compaction_block = response.content[0]
# 通过在压缩块之后包含它们来保留先前的交换 + 当前用户消息(3 条消息)
preserved_messages = messages[-3:] if len(messages) >= 3 else messages
# 构建新消息列表:压缩 + 保留的消息
new_assistant_content = [compaction_block]
messages_after_compaction = [
{"role": "assistant", "content": new_assistant_content}
] + preserved_messages
# 继续请求,使用压缩的上下文 + 保留的消息
response = client.beta.messages.create(
betas=["compact-2026-01-12"],
model="claude-opus-4-7",
max_tokens=4096,
messages=messages_after_compaction,
context_management={"edits": [{"type": "compact_20260112"}]},
)
# 更新我们的消息列表以反映压缩
messages.clear()
messages.extend(messages_after_compaction)
# 追加最终响应
messages.append({"role": "assistant", "content": response.content})
# 返回文本内容
return next(block.text for block in response.content if block.type == "text")
# 运行长对话
print(chat("Help me build a Python web scraper"))
print(chat("Add support for JavaScript-rendered pages"))
print(chat("Now add rate limiting and error handling"))
# ... 根据需要继续探索一个实际的实现,该实现使用后台线程和提示缓存来管理长时间运行的对话,具有即时会话内存压缩。
了解上下文窗口大小和管理策略。
探索其他管理对话上下文的策略,如工具结果清除和思考块清除。
Was this page helpful?