Alchemy RecipeIntermediateworkflow

Halve your Claude API bill: when prompt caching actually works and when it silently doesn't

Prompt caching can knock 50 to 90 percent off your Claude API bill for the right workloads, but five common mistakes leave you paying full price while thinking you are saving money. Here is how to verify it is actually working.

Monthly cost
~10-50% of current Claude spend/mo
Published

You run a customer support agent built on Claude. Every request includes the same 8,000-token system prompt with your product documentation, plus the same set of tool definitions, plus the actual customer message. You are paying for 8,500 input tokens on every single request. The monthly bill is starting to hurt.

Anthropic's prompt caching feature is built for exactly this workload. Enable it on the stable parts of your prompt and you will pay full price on the first request, then 10 percent of the input token cost on every subsequent request within the cache TTL. If your agent handles 100 messages an hour during business hours, the maths works out to roughly an 80 percent reduction on input token spend. If you get it wrong, the maths works out to exactly 100 percent of your current bill plus a 25 percent surcharge on cache writes, and you will not see an error.

This post covers how to turn caching on properly, how to verify it is actually working on every request, and the five ways it silently fails.

Prerequisites

  • An Anthropic API key with access to Claude 3.5 Sonnet, Claude 3 Opus, or Claude 3 Haiku. Caching works on Sonnet and Opus with a 1024-token minimum, and on Haiku with a 2048-token minimum.
  • One API call currently in production that you want to optimise. Ideally it has a large-ish static context (system prompt, retrieval documents, tool definitions) and a small dynamic part (the actual user message or query).
  • The Anthropic Python SDK at version 0.30 or later, or the TypeScript SDK at 0.28 or later. Both added prompt caching support through the cache_control field.
  • A log or spreadsheet where you can record the three token count fields from every response (cache_creation_input_tokens, cache_read_input_tokens, input_tokens) so you can confirm the cache is hitting.
  • Estimated setup time for a single call: 15 minutes. Time to verify it is saving money: 48 hours of production traffic.

How the cache actually works

Prompt caching lets you mark specific content blocks in your request with cache_control: {"type": "ephemeral"}. Anthropic's servers hash the content of that block along with everything before it in the request. The next time a request comes in with the same hash prefix, the model reads from the cache instead of re-processing those tokens.

The pricing model decides whether caching is a win for you:

OperationCost relative to normal input
Cache write (first request)1.25x
Cache read (subsequent requests)0.1x
Normal input tokens1.0x

Break-even point: if the cached content is read more than two or three times within the TTL, you are ahead. The standard cache TTL is 5 minutes. Anthropic also offers an extended 1-hour cache at a higher write cost (2x instead of 1.25x).

How to build it

Step 1: Mark the stable parts of your prompt

The cache works by prefix matching. Everything up to and including a cache_control breakpoint is eligible for caching. Everything after it is not. The rule: put your breakpoint after the content you want cached, and make sure everything before it is byte-identical across requests.

Here is a typical support agent with a cached system prompt and cached tool definitions:

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a customer support agent for Acme Ltd.
Answer questions using only the product documentation below.

PRODUCT DOCUMENTATION:
<long document goes here, roughly 6000 tokens>
"""

TOOLS = [
    {
        "name": "lookup_order",
        "description": "Look up order details by order ID",
        "input_schema": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    },
    # ... more tools
]

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=TOOLS,
    messages=[
        {"role": "user", "content": "Where is my order #12345?"}
    ],
)

print(f"Cache creation: {response.usage.cache_creation_input_tokens}")
print(f"Cache read:     {response.usage.cache_read_input_tokens}")
print(f"Regular input:  {response.usage.input_tokens}")

On the first request, you will see something like:

Cache creation: 6400
Cache read:     0
Regular input:  120

On the second request with the same system prompt but a different user message:

Cache creation: 0
Cache read:     6400
Regular input:  95

That is the cache working. If cache_read_input_tokens stays at 0 on your second and third requests, something is wrong.

Step 2: Verify the cache is actually being hit

This is where most people stop and assume it is working. It probably is not. Log the three token counts on every request for the first 48 hours and plot the ratio of cache reads to total input tokens. You want to see this climb towards 85 to 95 percent on a workload with a stable prefix. If it plateaus at 0 or randomly jumps between cached and uncached, something in your prefix is changing.

Add this to your logging:

def log_cache_metrics(response):
    u = response.usage
    total_input = (
        u.input_tokens
        + (u.cache_creation_input_tokens or 0)
        + (u.cache_read_input_tokens or 0)
    )
    hit_ratio = (u.cache_read_input_tokens or 0) / total_input if total_input else 0
    print(f"Cache hit ratio: {hit_ratio:.1%}")
    print(
        f"Tokens: {u.cache_read_input_tokens} cached, "
        f"{u.cache_creation_input_tokens} written, "
        f"{u.input_tokens} fresh"
    )

Step 3: Do the maths before you roll out

Before you ship caching to production, run a quick spreadsheet. Take your current monthly input token count and split it into "stable prefix" and "dynamic suffix". Estimate how many requests land within a 5-minute window. If the stable prefix is over 5,000 tokens and you routinely have more than 3 requests per TTL window, caching will halve your bill. If the prefix is under 2,000 tokens or your traffic is extremely spiky, it probably is not worth the setup time.

The five silent failures

1. Something in your cached block changes between requests

The most common failure. Your system prompt contains the current date, or a session ID, or a user-specific greeting, or a timestamp for "as of" staleness checks. Each request writes a new cache entry. You never get a read. Fix: move all the changing content after the cache_control breakpoint, or into the messages array.

2. Your cached block is below the minimum token threshold

For Claude 3.5 Sonnet and Claude 3 Opus, the minimum cacheable content is 1024 tokens. For Claude 3 Haiku, it is 2048 tokens. If your system prompt is 800 tokens, the cache_control marker is silently ignored and you pay full price with no error. Fix: check your token count with the Anthropic token counter before wiring up caching, and do not bother if your prefix is below the threshold.

3. Your cache TTL expired between requests

Standard cache entries live for 5 minutes. If your support agent handles one request every 10 minutes, every request is a cache miss. Two fixes: use the extended 1-hour cache (pay 2x on writes instead of 1.25x), or send a lightweight keep-alive request every 4 minutes to refresh the cache.

4. You put the breakpoint in the wrong place

The cache works on the whole prefix up to and including the breakpoint, not just the block with the marker. If you mark your tool definitions with cache_control but the system prompt above them changes between requests, the cache never hits because the prefix hash is different. Fix: put the breakpoint at the last stable point in your request, not the first.

5. You have more than four cache breakpoints

Anthropic allows up to four cache_control markers per request. If you have five, the extras are silently ignored. Fix: count them and consolidate.

When it's not worth turning on

Prompt caching is designed for the "same context, different final message" pattern. If your application is genuinely unique on every request (a summarisation tool where each input is a different document), caching gives you nothing and the 25 percent surcharge on cache writes actively hurts. If your traffic is bursty with long idle gaps, the 5-minute TTL will bite more than you think. Extended 1-hour caching helps but doubles the write cost.

Run the maths on your actual pattern before turning it on as a default, and check the usage fields on every response for at least a week afterwards to make sure it is actually saving money. The numbers in the invoice are the only ones that count.

The same feature on OpenAI

The OpenAI API has its own version of prompt caching, called automatic prompt caching, which turned on by default for GPT-4o and later. It works differently: no explicit breakpoint marker, no cache_control field, it caches any input over 1024 tokens automatically if the prefix matches an earlier request. TTL behaviour is different too (typically around 5-10 minutes, not documented as precisely as Anthropic's).

The failure modes in this post mostly do not apply to OpenAI because there is nothing for you to configure wrong. The flip side is that the savings are smaller (50 percent discount on cached input, not 90 percent), and you cannot control where the cache boundary lives. If you are on OpenAI, you get a small automatic saving with zero setup work. If you are on Anthropic, you get a bigger saving with this post's 15 minutes of setup.

More Recipes