/
Blog
·9 min read

Token-based vs hardware-based LLM CO₂ estimates: when to use which

Coefficients per 1k tokens vs duration × power × grid intensity — trade-offs, uncertainty, and how to keep both comparable in your methodology note.

There are two broad ways to approximate inference emissions: token × coefficient (activity × factor) or hardware proxies (duration, power draw, grid carbon intensity). Neither is “truth”; both are models. The right choice depends on data you actually have and what you need to defend in a methodology note.

Token-based estimates

You multiply normalized token counts by a grams CO₂e per 1k tokens factor, ideally sourced from vendor LCAs, cloud disclosures, or peer-reviewed benchmarks, with explicit confidence labels. Strength: cheap to integrate, stable API contract, easy to aggregate per tenant. Limit: factors bundle hardware, region, and fleet mix unless you split them in your own model.

Hardware-based estimates

When you can observe inference duration and assume a server power envelope plus a grid intensity (e.g. by country), you can build an energy × carbon intensity story. Strength: speaks the language of infrastructure teams. Limit: sensitive to assumptions (GPU model, utilization, PUE) and often harder to standardize across tenants.

Keeping comparisons honest

If you expose both methods, report side-by-side with different confidence tiers — not as duplicate “official” totals. Many teams use tokens for recurring Scope 3-style totals and hardware for sensitivity analysis or internal capacity planning.

Disclaimer. Method choice interacts with your reporting boundary and materiality. Involve sustainability and legal stakeholders for CSRD-facing narratives.