/
Blog
·10 min read

How much CO₂ does an AI query really use? Vendor data, research ranges, and CSRD-ready logging

Why ChatGPT and Gemini “grams per query” figures disagree — Epoch AI, published methodology, grid factors — and what to log for ESRS E1 / Scope 3 instead of chasing a headline.

If you search for “grams of CO₂ per ChatGPT query”, you will see answers that look incompatible — from under a tenth of a gram to several grams for a single exchange. That confusion is not just social media noise: different studies measure different boundaries (inference only vs amortized training), different workloads (short chat vs long context + tools), and different grids. For sustainability and finance teams, the actionable move is not picking the catchiest headline; it is adopting a repeatable activity dataset (tokens, model, environment) that your CSRD narrative can defend.

What recent vendor and research estimates actually say

Independent modeling groups have updated bottom-up estimates as models and hardware improved. Epoch AI, revisiting ChatGPT-class usage with clearer assumptions, finds that a typical GPT-4o query may land around ~0.3 watt-hours of electricity — materially lower than some early-2023 back-of-the-envelope figures, largely because of efficiency gains and more realistic token assumptions. In public commentary, OpenAI has pointed to a similar order of magnitude for an “average” ChatGPT interaction (on the order of a few tenths of a watt-hour per query).

Google has published methodology stating that a median text prompt to Gemini can be on the order of ~0.24 Wh with a small associated CO₂e figure for that disclosed scenario — an unusually transparent datapoint, but still one product and one definition of “median prompt,” not a universal coefficient for every LLM call your company makes.

Commentary and secondary analyses that convert watt-hours to grams often apply a grid emission factor. That step is where two honest calculators diverge: the same 0.3 Wh looks different on a hydro-heavy region than on a coal-heavy one. Any single “grams per query” number without stating the factor is incomplete.

Why some headlines say “a few grams” and others “negligible”

  • Inference vs everything else. Training, embodied hardware, over-provisioning, and cooling can be included or excluded. Enterprise reporting usually starts with operational electricity for inference and documents what is out of scope.
  • Workload shape. A short question-answer pair is not comparable to a coding-agent session with many tool calls, or a long-document prefill where compute scales with tokens and model architecture. Practitioner write-ups note orders-of-magnitude gaps between “median chat” and “agentic” workflows.
  • Reasoning models. Newer “reasoning” stacks can spend far more tokens (and time) per user-visible answer than classic chat, which pushes per-task energy toward heavier categories — closer to how people already think about streaming or batch jobs than about a single web search.

CSRD, ESRS E1, and where LLM use shows up

Under the EU corporate sustainability reporting frame, ESRS E1 (Climate change) expects a structured greenhouse gas picture: Scopes 1–3, targets, and transition narrative where material. Cloud and purchased AI services typically push teams into Scope 3 discussions — especially categories tied to purchased goods and services — rather than pretending generative AI is immaterial because “it is software.”

The reporting requirement is not to guess a viral CO₂ meme; it is to show how you know what you claim: activity data, assumptions, and limitations. That is why finance and sustainability leads increasingly ask engineering for token-level usage tied to cost centers or products — the same backbone you would use for unit economics, with a documented emission factor layered on top.

For a deeper internal map, start with our posts on Scope 3 first steps for LLM usage, why per-query estimates disagree, and token → CO₂e coefficients.

A practical logging checklist (better than debating averages)

  1. Capture prompt and completion tokens (and model id) from provider APIs — the fields exist precisely so usage is auditable.
  2. Separate environments (prod vs sandbox) and major products so Scope 3 allocations do not collapse into one bucket.
  3. Version your coefficients and state whether they cover inference only; recompute when you change regions or providers.
  4. Label uncertainty where the grid location of a specific request is unknown — a common gap noted in third-party accounting methodologies.

See impact while you work

For individuals and teams who want indicative feedback at the point of use — without shipping prompt text to us — the carbon-llm Chrome extension turns provider-reported usage into approximate CO₂e, and My footprint aggregates patterns over time. Pair that with our methodology for what is in and out of the estimate.

Sources & further reading

External pages are independent; carbon-llm does not endorse or control third-party content.

Disclaimer. This article summarizes public research and vendor disclosures for education; it is not legal advice. CSRD applicability depends on your entity and timeline — verify with your advisors.