In a transformer, the forward pass over the prompt (prefill) is a major driver of inference cost and latency. Prompt caching reuses stored key/value tensors from attention layers when a long prefix repeats—so the model skips redundant computation on those tokens. Major providers now expose this as a product feature, not just an internal optimization.
Why caching changes both $ and kWh
Commercial APIs price cached input tokens well below uncached input tokens because the provider avoids repeating the expensive prefill work. Discounts vary by model family and are published on each vendor’s pricing page—treat them as a proxy for avoided compute, not a literal energy meter. For carbon accounting at the customer level, the honest approach is still to allocate using billed tokens × your methodology coefficients (see how we estimate CO₂e), while using cache hit rates as a reduction lever in sustainability narratives.
Provider models differ
Some stacks apply automatic prefix caching when prompts exceed a minimum size and prefixes match exactly; others ask developers to mark cache breakpoints (e.g. ephemeral cache control on long system prompts). Open documentation explains that caching targets the KV projections inside attention—the same layer where open inference servers implement prefix caching to reuse KV blocks across requests. Your integration choices (stable system prompts, tool definitions ordered consistently) directly affect hit rate.
What to measure internally
- Cache hit ratio on long shared prefixes (support playbooks, legal clauses, RAG corpora).
- Time-to-first-token improvements—often a side effect of skipping prefill work.
- Tokens still generated—caching inputs does not shrink completion length; output tokens often dominate energy on open-ended tasks (token vs hardware).
Research angle
Academic evaluations of prompt caching for long-horizon agents highlight KV reuse as a tool for cost and compute control when the same prefixes recur. That aligns with how production teams should think: structure prompts for repeatability, then verify with billing dashboards and latency metrics.
Sources & further reading
- OpenAI — Prompt Caching 201 (KV cache mechanics)
- vLLM — Automatic prefix caching design
- arXiv — Don’t Break the Cache (prompt caching for long-horizon agentic tasks)
- arXiv — How Hungry is AI? (benchmarking energy & carbon of LLM inference)
External pages are independent; carbon-llm does not endorse or control third-party content.
Disclaimer. Pricing and features change; confirm current provider docs before relying on discount percentages in external reporting.