/
Blog
·8 min read

How prompt caching cuts LLM prefill work—and why it matters for CO₂e

KV reuse in transformers skips redundant attention compute on repeated prefixes; provider pricing reflects avoided work, and open stacks use prefix caching similarly.

In a transformer, the forward pass over the prompt (prefill) is a major driver of inference cost and latency. Prompt caching reuses stored key/value tensors from attention layers when a long prefix repeats—so the model skips redundant computation on those tokens. Major providers now expose this as a product feature, not just an internal optimization.

Try the tools (measurement)
Estimate CO₂e from tokens, read coefficients, or see values in the browser — editorial series pillar 1.

Why caching changes both $ and kWh

Commercial APIs price cached input tokens well below uncached input tokens because the provider avoids repeating the expensive prefill work. Discounts vary by model family and are published on each vendor’s pricing page—treat them as a proxy for avoided compute, not a literal energy meter. For carbon accounting at the customer level, the honest approach is still to allocate using billed tokens × your methodology coefficients (see how we estimate CO₂e), while using cache hit rates as a reduction lever in sustainability narratives.

Provider models differ

Some stacks apply automatic prefix caching when prompts exceed a minimum size and prefixes match exactly; others ask developers to mark cache breakpoints (e.g. ephemeral cache control on long system prompts). Open documentation explains that caching targets the KV projections inside attention—the same layer where open inference servers implement prefix caching to reuse KV blocks across requests. Your integration choices (stable system prompts, tool definitions ordered consistently) directly affect hit rate.

What to measure internally

  • Cache hit ratio on long shared prefixes (support playbooks, legal clauses, RAG corpora).
  • Time-to-first-token improvements—often a side effect of skipping prefill work.
  • Tokens still generated—caching inputs does not shrink completion length; output tokens often dominate energy on open-ended tasks (token vs hardware).

Research angle

Academic evaluations of prompt caching for long-horizon agents highlight KV reuse as a tool for cost and compute control when the same prefixes recur. That aligns with how production teams should think: structure prompts for repeatability, then verify with billing dashboards and latency metrics.

Sources & further reading

External pages are independent; carbon-llm does not endorse or control third-party content.

Disclaimer. Pricing and features change; confirm current provider docs before relying on discount percentages in external reporting.