Loading...
  • 建構
  • 管理
  • 模型與定價
  • 客戶端 SDK
  • API 參考
Search...
⌘K
Log in
壓縮
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Solutions

  • AI agents
  • Code modernization
  • Coding
  • Customer support
  • Education
  • Financial services
  • Government
  • Life sciences

Partners

  • Amazon Bedrock
  • Google Cloud's Vertex AI

Learn

  • Blog
  • Courses
  • Use cases
  • Connectors
  • Customer stories
  • Engineering at Anthropic
  • Events
  • Powered by Claude
  • Service partners
  • Startups program

Company

  • Anthropic
  • Careers
  • Economic Futures
  • Research
  • News
  • Responsible Scaling Policy
  • Security and compliance
  • Transparency

Learn

  • Blog
  • Courses
  • Use cases
  • Connectors
  • Customer stories
  • Engineering at Anthropic
  • Events
  • Powered by Claude
  • Service partners
  • Startups program

Help and security

  • Availability
  • Status
  • Support
  • Discord

Terms and policies

  • Privacy policy
  • Responsible disclosure policy
  • Terms of service: Commercial
  • Terms of service: Consumer
  • Usage policy
建構/上下文管理

壓縮

用於管理接近上下文視窗限制的長對話的伺服器端上下文壓縮。

Was this page helpful?

This feature is eligible for Zero Data Retention (ZDR). When your organization has a ZDR arrangement, data sent through this feature is not stored after the API response is returned.

伺服器端壓縮是管理長期運行對話和代理工作流程中上下文的推薦策略。它以最少的整合工作自動處理上下文管理。

壓縮通過在接近上下文視窗限制時自動總結較舊的上下文,延長了長期運行對話和任務的有效上下文長度。這不僅僅是關於保持在令牌上限以下。隨著對話變得更長,模型在維持整個歷史記錄的焦點方面會遇到困難。壓縮通過用簡潔的摘要替換陳舊內容,保持活躍上下文的焦點和性能。

如需深入了解為什麼長上下文會降級以及壓縮如何幫助,請參閱 有效的上下文工程。

這適合用於:

  • 聊天型多輪對話,您希望用戶在長時間內使用一個聊天
  • 需要大量後續工作(通常是工具使用)的任務導向提示,可能會超過上下文視窗

壓縮處於測試版。在您的 API 請求中包含測試版標頭 compact-2026-01-12 以使用此功能。

支援的模型

壓縮在以下模型上受支援:

  • Claude Mythos Preview (claude-mythos-preview)
  • Claude Opus 4.7 (claude-opus-4-7)
  • Claude Opus 4.6 (claude-opus-4-6)
  • Claude Sonnet 4.6 (claude-sonnet-4-6)

壓縮的工作原理

啟用壓縮後,當 Claude 接近配置的令牌閾值時,會自動總結您的對話。API 會:

  1. 檢測輸入令牌何時超過您指定的觸發閾值。
  2. 生成當前對話的摘要。
  3. 建立包含摘要的 compaction 區塊。
  4. 使用壓縮的上下文繼續回應。

在後續請求中,將回應附加到您的訊息。API 會自動刪除 compaction 區塊之前的所有訊息區塊,從摘要繼續對話。

流程圖顯示壓縮過程:當輸入令牌超過觸發閾值時,Claude 在壓縮區塊中生成摘要並使用壓縮的上下文繼續回應

基本用法

通過在 Messages API 請求中將 compact_20260112 策略新增到 context_management.edits 來啟用壓縮。

參數

參數類型預設值描述
typestring必需必須是 "compact_20260112"
triggerobject150,000 令牌何時觸發壓縮。必須至少 50,000 令牌。
pause_after_compactionbooleanfalse生成壓縮摘要後是否暫停
instructionsstringnull自訂總結提示。提供時完全替換預設提示。

觸發配置

使用 trigger 參數配置何時觸發壓縮:

自訂總結指示

預設情況下,壓縮使用以下總結提示:

You have written a partial transcript for the initial task above. Please write a summary of the transcript. The purpose of this summary is to provide continuity so you can continue to make progress towards solving the task in a future context, where the raw history above may not be accessible and will be replaced with this summary. Write down anything that would be helpful, including the state, next steps, learnings etc. You must wrap your summary in a <summary></summary> block.

您可以通過 instructions 參數提供自訂指示以完全替換此提示。自訂指示不補充預設值;它們完全替換它:

壓縮後暫停

使用 pause_after_compaction 在 API 生成壓縮摘要後暫停。這允許您在 API 繼續回應之前添加額外的內容塊(例如保留最近的消息或特定的指令導向消息)。

啟用後,API 會在生成壓縮塊後返回一條具有 compaction 停止原因的消息:

強制執行總令牌預算

當模型在具有許多工具使用迭代的長任務上工作時,總令牌消耗可能會顯著增長。您可以將 pause_after_compaction 與壓縮計數器結合使用,以估計累積使用情況,並在達到預算後優雅地完成任務:

Python
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
TRIGGER_THRESHOLD = 100_000
TOTAL_TOKEN_BUDGET = 3_000_000
n_compactions = 0

response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={
        "edits": [
            {
                "type": "compact_20260112",
                "trigger": {"type": "input_tokens", "value": TRIGGER_THRESHOLD},
                "pause_after_compaction": True,
            }
        ]
    },
)

if response.stop_reason == "compaction":
    n_compactions += 1
    messages.append({"role": "assistant", "content": response.content})

    # Estimate total tokens consumed; prompt wrap-up if over budget
    if n_compactions * TRIGGER_THRESHOLD >= TOTAL_TOKEN_BUDGET:
        messages.append(
            {
                "role": "user",
                "content": "Please wrap up your current work and summarize the final state.",
            }
        )

使用壓縮塊

當觸發壓縮時,API 會在助手回應的開始處返回一個 compaction 塊。

長時間運行的對話可能會導致多次壓縮。最後的壓縮塊反映了提示的最終狀態,用生成的摘要替換其之前的內容。

Output
{
  "content": [
    {
      "type": "compaction",
      "content": "Summary of the conversation: The user requested help building a web scraper..."
    },
    {
      "type": "text",
      "text": "Based on our conversation so far..."
    }
  ]
}

傳遞壓縮塊回去

您必須在後續請求中將 compaction 塊傳遞回 API,以繼續使用縮短的提示進行對話。最簡單的方法是將整個回應內容附加到您的消息中:

當 API 接收到 compaction 塊時,其之前的所有內容塊都會被忽略。您可以:

  • 將原始消息保留在列表中,讓 API 處理移除壓縮的內容
  • 手動刪除壓縮的消息,只包含壓縮塊及之後的內容

流式傳輸

當啟用壓縮的流式回應時,您會在壓縮開始時收到 content_block_start 事件。壓縮塊的流式傳輸方式與文本塊不同。您會收到一個 content_block_start 事件,然後是一個包含完整摘要內容的單個 content_block_delta(無中間流式傳輸),最後是一個 content_block_stop 事件。

提示詞快取

壓縮與提示詞快取配合效果很好。您可以在壓縮區塊上新增 cache_control 中斷點來快取摘要內容。原始的壓縮內容會被忽略。

{
  "role": "assistant",
  "content": [
    {
      "type": "compaction",
      "content": "[summary text]",
      "cache_control": { "type": "ephemeral" }
    },
    {
      "type": "text",
      "text": "Based on our conversation..."
    }
  ]
}

使用系統提示詞最大化快取命中

當壓縮發生時,摘要會成為需要寫入快取的新內容。如果沒有額外的快取中斷點,這也會使任何快取的系統提示詞失效,需要與壓縮摘要一起重新快取。

為了最大化快取命中率,在系統提示詞的末尾新增 cache_control 中斷點。這樣可以將系統提示詞與對話分開快取,因此當壓縮發生時:

  • 系統提示詞快取保持有效並從快取中讀取
  • 只有壓縮摘要需要作為新的快取項目寫入

這種方法對於長系統提示詞特別有益,因為即使在對話中發生多次壓縮事件,它們也會保持快取狀態。

理解使用情況

壓縮需要額外的採樣步驟,這會對速率限制和計費產生影響。API 在回應中返回詳細的使用資訊:

Output
{
  "usage": {
    "input_tokens": 23000,
    "output_tokens": 1000,
    "iterations": [
      {
        "type": "compaction",
        "input_tokens": 180000,
        "output_tokens": 3500
      },
      {
        "type": "message",
        "input_tokens": 23000,
        "output_tokens": 1000
      }
    ]
  }
}

iterations 陣列顯示每個採樣迭代的使用情況。當壓縮發生時,您會看到一個 compaction 迭代,然後是主要的 message 迭代。在此範例中,頂級 input_tokens 和 output_tokens 與 message 迭代完全相符,因為只有一個非壓縮迭代。最終迭代的令牌計數反映壓縮後的有效內容大小。

頂級 input_tokens 和 output_tokens 不包括壓縮迭代使用情況。它們反映所有非壓縮迭代的總和。要計算請求消耗和計費的總令牌數,請對 usage.iterations 陣列中的所有項目求和。

如果您之前依賴 usage.input_tokens 和 usage.output_tokens 進行成本追蹤或審計,當啟用壓縮時,您需要更新追蹤邏輯以在 usage.iterations 中進行彙總。iterations 陣列僅在請求期間觸發新壓縮時填充。重新應用先前的 compaction 區塊不會產生額外的壓縮成本,在這種情況下頂級使用欄位保持準確。

與其他功能結合

伺服器工具

使用伺服器工具(如網路搜尋)時,壓縮觸發器在每個採樣迭代開始時檢查。根據您的觸發閾值和產生的輸出量,壓縮可能在單個請求中發生多次。

令牌計數

令牌計數端點(/v1/messages/count_tokens)應用提示詞中現有的 compaction 區塊,但不觸發新的壓縮。使用它來檢查先前壓縮後的有效令牌計數:

範例

以下是一個完整的長期對話與壓縮的範例:

以下是一個使用 pause_after_compaction 的範例,用於保留先前的交換和目前的使用者訊息(總共三個訊息)逐字,而不是進行摘要:

目前的限制

  • 摘要使用相同的模型: 您的請求中指定的模型用於摘要。沒有選項可以使用不同的模型(例如,更便宜的模型)來進行摘要。

後續步驟

工作階段記憶體壓縮食譜

探索一個實際的實作,使用背景執行緒和提示快取來管理長期執行的對話,並進行即時工作階段記憶體壓縮。

上下文視窗

了解上下文視窗大小和管理策略。

client = anthropic.Anthropic()

messages = [{"role": "user", "content": "Help me build a website"}]

response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
)

# Append the response (including any compaction block) to continue the conversation
messages.append({"role": "assistant", "content": response.content})
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={
        "edits": [
            {
                "type": "compact_20260112",
                "trigger": {"type": "input_tokens", "value": 150000},
            }
        ]
    },
)
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={
        "edits": [
            {
                "type": "compact_20260112",
                "instructions": "Focus on preserving code snippets, variable names, and technical decisions.",
            }
        ]
    },
)
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={
        "edits": [{"type": "compact_20260112", "pause_after_compaction": True}]
    },
)

# Check if compaction triggered a pause
if response.stop_reason == "compaction":
    # Response contains only the compaction block
    messages.append({"role": "assistant", "content": response.content})

    # Continue the request
    response = client.beta.messages.create(
        betas=["compact-2026-01-12"],
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=messages,
        context_management={"edits": [{"type": "compact_20260112"}]},
    )
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
)
# After receiving a response with a compaction block
messages.append({"role": "assistant", "content": response.content})

# Continue the conversation
messages.append({"role": "user", "content": "Now add error handling"})

response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
)
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]

with client.beta.messages.stream(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
) as stream:
    for event in stream:
        if event.type == "content_block_start":
            if event.content_block.type == "compaction":
                print("Compaction started...")
            elif event.content_block.type == "text":
                print("Text response started...")

        elif event.type == "content_block_delta":
            if event.delta.type == "compaction_delta":
                print(f"Compaction complete: {len(event.delta.content)} chars")
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

    # Get the final accumulated message
    message = stream.get_final_message()
    messages.append({"role": "assistant", "content": message.content})
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
response = client.beta.messages.create(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": "You are a helpful coding assistant...",
            "cache_control": {
                "type": "ephemeral"
            },  # Cache the system prompt separately
        }
    ],
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
)
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Hello, Claude"}]
count_response = client.beta.messages.count_tokens(
    betas=["compact-2026-01-12"],
    model="claude-opus-4-7",
    messages=messages,
    context_management={"edits": [{"type": "compact_20260112"}]},
)

print(f"Current tokens: {count_response.input_tokens}")
print(f"Original tokens: {count_response.context_management.original_input_tokens}")
client = anthropic.Anthropic()

messages: list[dict] = []


def chat(user_message: str) -> str:
    messages.append({"role": "user", "content": user_message})

    response = client.beta.messages.create(
        betas=["compact-2026-01-12"],
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=messages,
        context_management={
            "edits": [
                {
                    "type": "compact_20260112",
                    "trigger": {"type": "input_tokens", "value": 100000},
                }
            ]
        },
    )

    # Append response (compaction blocks are automatically included)
    messages.append({"role": "assistant", "content": response.content})

    # Return the text content
    return next(block.text for block in response.content if block.type == "text")


# Run a long conversation
print(chat("Help me build a Python web scraper"))
print(chat("Add support for JavaScript-rendered pages"))
print(chat("Now add rate limiting and error handling"))
# ... continue as long as needed
from typing import Any

client = anthropic.Anthropic()

messages: list[dict[str, Any]] = []


def chat(user_message: str) -> str:
    messages.append({"role": "user", "content": user_message})

    response = client.beta.messages.create(
        betas=["compact-2026-01-12"],
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=messages,
        context_management={
            "edits": [
                {
                    "type": "compact_20260112",
                    "trigger": {"type": "input_tokens", "value": 100000},
                    "pause_after_compaction": True,
                }
            ]
        },
    )

    # Check if compaction occurred and paused
    if response.stop_reason == "compaction":
        # Get the compaction block from the response
        compaction_block = response.content[0]

        # Preserve the prior exchange + current user message (3 messages)
        # by including them after the compaction block
        preserved_messages = messages[-3:] if len(messages) >= 3 else messages

        # Build new message list: compaction + preserved messages
        new_assistant_content = [compaction_block]
        messages_after_compaction = [
            {"role": "assistant", "content": new_assistant_content}
        ] + preserved_messages

        # Continue the request with the compacted context + preserved messages
        response = client.beta.messages.create(
            betas=["compact-2026-01-12"],
            model="claude-opus-4-7",
            max_tokens=4096,
            messages=messages_after_compaction,
            context_management={"edits": [{"type": "compact_20260112"}]},
        )

        # Update our message list to reflect the compaction
        messages.clear()
        messages.extend(messages_after_compaction)

    # Append the final response
    messages.append({"role": "assistant", "content": response.content})

    # Return the text content
    return next(block.text for block in response.content if block.type == "text")


# Run a long conversation
print(chat("Help me build a Python web scraper"))
print(chat("Add support for JavaScript-rendered pages"))
print(chat("Now add rate limiting and error handling"))
# ... continue as long as needed
上下文編輯

探索其他管理對話上下文的策略,例如工具結果清除和思考區塊清除。