Headlines about “grams of CO₂ per ChatGPT query” rarely agree. That is not always because someone is wrong — it is often because different studies answer different questions: inference only vs training amortization, short prompts vs long documents, frontier models vs smaller endpoints, and which grid or LCA boundary they assume.

What gets bundled into “one query”

Some estimates isolate inference electricity for a single exchange. Others spread training and hardware across expected lifetime queries. A third family uses vendor or cloud disclosures tied to specific regions and fleet efficiency. Comparing a “median text query” from one provider to a worst-case marketing blog without reading the methodology is how you get order-of-magnitude spreads.

Why ranges can still be useful

For product and sustainability teams, the lesson is not to pick the smallest number. It is to document your boundary: which model, which token counts, which coefficient source, and which confidence label (measured, benchmarked, estimated). Peer-reviewed work and public LCAs increasingly converge on transparent, repeatable methods rather than headline grams.

What we do at carbon-llm

Our public /api/v1/estimate endpoint and browser tools use token-derived coefficients with traceable sources and explicit limitations — so you can explain a number in a methodology note, not only display it in a dashboard.

Sources & further reading

External pages are independent; carbon-llm does not endorse or control third-party content.

Disclaimer. Figures are indicative models, not meter readings. Use them for awareness, prioritization, and disclosure support — not as substitute for your own materiality and legal review.

Methodology →Research landscape →Token vs hardware estimates →