Speculative Prompt Caching
This cookbook demonstrates "Speculative Prompt Caching" - a pattern that reduces time-to-first-token (TTFT) by warming up the cache while users are still formulating their queries.
Without Speculative Caching:
- User types their question (3 seconds)
- User submits question
- API loads context into cache AND generates response
With Speculative Caching:
- User starts typing (cache warming begins immediately)
- User continues typing (cache warming completes in background)
- User submits question
- API uses warm cache to generate response
Setup
First, let's install the required packages:
%pip install anthropic httpx --quietNote: you may need to restart the kernel to use updated packages.
import asyncio
import copy
import datetime
import time
import httpx
from anthropic import AsyncAnthropic
# Configuration constants
MODEL = "claude-sonnet-4-5"
SQLITE_SOURCES = {
"btree.h": "https://sqlite.org/src/raw/18e5e7b2124c23426a283523e5f31a4bff029131b795bb82391f9d2f3136fc50?at=btree.h",
"btree.c": "https://sqlite.org/src/raw/63ca6b647342e8cef643863cd0962a542f133e1069460725ba4461dcda92b03c?at=btree.c",
}
DEFAULT_CLIENT_ARGS = {
"system": "You are an expert systems programmer helping analyze database internals.",
"max_tokens": 4096,
"temperature": 0,
}Helper Functions
Let's set up the functions to download our large context and prepare messages:
async def get_sqlite_sources() -> dict[str, str]:
print("Downloading SQLite source files...")
source_files = {}
start_time = time.time()
async with httpx.AsyncClient(timeout=30.0) as client:
tasks = []
async def download_file(filename: str, url: str) -> tuple[str, str]:
response = await client.get(url, follow_redirects=True)
response.raise_for_status()
print(f"Successfully downloaded {filename}")
return filename, response.text
for filename, url in SQLITE_SOURCES.items():
tasks.append(download_file(filename, url))
results = await asyncio.gather(*tasks)
source_files = dict(results)
duration = time.time() - start_time
print(f"Downloaded {len(source_files)} files in {duration:.2f} seconds")
return source_files
async def create_initial_message():
sources = await get_sqlite_sources()
# Prepare the initial message with the source code as context.
# A Timestamp is included to prevent cache sharing across different runs.
initial_message = {
"role": "user",
"content": [
{
"type": "text",
"text": f"""
Current time: {datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
Source to Analyze:
btree.h:
```c
{sources["btree.h"]}
```
btree.c:
```c
{sources["btree.c"]}
```""",
"cache_control": {"type": "ephemeral"},
}
],
}
return initial_message
async def sample_one_token(client: AsyncAnthropic, messages: list):
"""Send a single-token request to warm up the cache"""
args = copy.deepcopy(DEFAULT_CLIENT_ARGS)
args["max_tokens"] = 1
await client.messages.create(
messages=messages,
model=MODEL,
**args,
)
def print_query_statistics(response, query_type: str) -> None:
print(f"\n{query_type} query statistics:")
print(f"\tInput tokens: {response.usage.input_tokens}")
print(f"\tOutput tokens: {response.usage.output_tokens}")
print(f"\tCache read input tokens: {getattr(response.usage, 'cache_read_input_tokens', '---')}")
print(
f"\tCache creation input tokens: {getattr(response.usage, 'cache_creation_input_tokens', '---')}"
)Example 1: Standard Prompt Caching (Without Speculative Caching)
First, let's see how standard prompt caching works. The user types their question, then we send the entire context + question to the API:
async def standard_prompt_caching_demo():
client = AsyncAnthropic()
# Prepare the large context
initial_message = await create_initial_message()
# Simulate user typing time (in real app, this would be actual user input)
print("User is typing their question...")
await asyncio.sleep(3) # Simulate 3 seconds of typing
user_question = "What is the purpose of the BtShared structure?"
print(f"User submitted: {user_question}")
# Now send the full request (context + question)
full_message = copy.deepcopy(initial_message)
full_message["content"].append(
{"type": "text", "text": f"Answer the user's question: {user_question}"}
)
print("\nSending request to API...")
start_time = time.time()
# Measure time to first token
first_token_time = None
async with client.messages.stream(
messages=[full_message],
model=MODEL,
**DEFAULT_CLIENT_ARGS,
) as stream:
async for text in stream.text_stream:
if first_token_time is None and text.strip():
first_token_time = time.time() - start_time
print(f"\n🕐 Time to first token: {first_token_time:.2f} seconds")
break
# Get the full response
response = await stream.get_final_message()
total_time = time.time() - start_time
print(f"Total response time: {total_time:.2f} seconds")
print_query_statistics(response, "Standard Caching")
return first_token_time, total_time# Run the standard demo
standard_ttft, standard_total = await standard_prompt_caching_demo()Downloading SQLite source files... Successfully downloaded btree.h Successfully downloaded btree.c Downloaded 2 files in 0.30 seconds User is typing their question... User submitted: What is the purpose of the BtShared structure? Sending request to API... 🕐 Time to first token: 20.87 seconds Total response time: 28.32 seconds Standard Caching query statistics: Input tokens: 22 Output tokens: 362 Cache read input tokens: 0 Cache creation input tokens: 151629
Example 2: Speculative Prompt Caching
Now let's see how speculative prompt caching improves TTFT by warming the cache while the user is typing:
async def speculative_prompt_caching_demo():
client = AsyncAnthropic()
# The user has a large amount of context they want to interact with,
# in this case it's the sqlite b-tree implementation (~150k tokens).
initial_message = await create_initial_message()
# Start speculative caching while user is typing
print("User is typing their question...")
print("🔥 Starting cache warming in background...")
# While the user is typing out their question, we sample a single token
# from the context the user is going to be interacting with with explicit
# prompt caching turned on to warm up the cache.
cache_task = asyncio.create_task(sample_one_token(client, [initial_message]))
# Simulate user typing time
await asyncio.sleep(3) # Simulate 3 seconds of typing
user_question = "What is the purpose of the BtShared structure?"
print(f"User submitted: {user_question}")
# Ensure cache warming is complete
await cache_task
print("✅ Cache warming completed!")
# Prepare messages for cached query. We make sure we
# reuse the same initial message as was cached to ensure we have a cache hit.
cached_message = copy.deepcopy(initial_message)
cached_message["content"].append(
{"type": "text", "text": f"Answer the user's question: {user_question}"}
)
print("\nSending request to API (with warm cache)...")
start_time = time.time()
# Measure time to first token
first_token_time = None
async with client.messages.stream(
messages=[cached_message],
model=MODEL,
**DEFAULT_CLIENT_ARGS,
) as stream:
async for text in stream.text_stream:
if first_token_time is None and text.strip():
first_token_time = time.time() - start_time
print(f"\n🚀 Time to first token: {first_token_time:.2f} seconds")
break
# Get the full response
response = await stream.get_final_message()
total_time = time.time() - start_time
print(f"Total response time: {total_time:.2f} seconds")
print_query_statistics(response, "Speculative Caching")
return first_token_time, total_time# Run the speculative caching demo
speculative_ttft, speculative_total = await speculative_prompt_caching_demo()Downloading SQLite source files... Successfully downloaded btree.h Successfully downloaded btree.c Downloaded 2 files in 0.36 seconds User is typing their question... 🔥 Starting cache warming in background... User submitted: What is the purpose of the BtShared structure? ✅ Cache warming completed! Sending request to API (with warm cache)... 🚀 Time to first token: 1.94 seconds Total response time: 8.40 seconds Speculative Caching query statistics: Input tokens: 22 Output tokens: 330 Cache read input tokens: 151629 Cache creation input tokens: 0
Performance Comparison
Let's compare the results to see the benefit of speculative caching:
print("=" * 60)
print("PERFORMANCE COMPARISON")
print("=" * 60)
print("\nStandard Prompt Caching:")
print(f" Time to First Token: {standard_ttft:.2f} seconds")
print(f" Total Response Time: {standard_total:.2f} seconds")
print("\nSpeculative Prompt Caching:")
print(f" Time to First Token: {speculative_ttft:.2f} seconds")
print(f" Total Response Time: {speculative_total:.2f} seconds")
ttft_improvement = (standard_ttft - speculative_ttft) / standard_ttft * 100
total_improvement = (standard_total - speculative_total) / standard_total * 100
print("\n🎯 IMPROVEMENTS:")
print(
f" TTFT Improvement: {ttft_improvement:.1f}% ({standard_ttft - speculative_ttft:.2f}s faster)"
)
print(
f" Total Time Improvement: {total_improvement:.1f}% ({standard_total - speculative_total:.2f}s faster)"
)============================================================ PERFORMANCE COMPARISON ============================================================ Standard Prompt Caching: Time to First Token: 20.87 seconds Total Response Time: 28.32 seconds Speculative Prompt Caching: Time to First Token: 1.94 seconds Total Response Time: 8.40 seconds 🎯 IMPROVEMENTS: TTFT Improvement: 90.7% (18.93s faster) Total Time Improvement: 70.4% (19.92s faster)
Key Takeaways
- Speculative caching dramatically reduces TTFT by warming the cache while users are typing
- The pattern is most effective with large contexts (>1000 tokens) that are reused across queries
- Implementation is simple - just send a 1-token request while the user is typing
- Cache warming happens in parallel with user input, effectively "hiding" the cache creation time
Best Practices
- Start cache warming as early as possible (e.g., when a user focuses an input field)
- Use exactly the same context for warming and actual requests to ensure cache hits
- Monitor
cache_read_input_tokensto verify cache hits - Add timestamps to prevent unwanted cache sharing across sessions
