Reasoning models like o3, DeepSeek-R1, and Gemini Thinking deliver measurably better results on complex tasks. They also emit 30 to 50 times more CO₂ per call than a standard GPT-4o request. Most engineering teams find this out only when a CSRD questionnaire lands — or never.
What the research says
A June 2025 study published in arXiv:2507.11417 quantified inference emissions across model families. The finding that attracted the most attention: reasoning-enabled models averaged 543 "thinking" tokens per query, compared to 37 for concise models. Since energy consumption scales roughly linearly with tokens processed, the carbon gap is large and predictable.
A complementary benchmark (arXiv:2505.09598) put concrete Wh figures on it:
| Model | Type | Wh / short query | Relative to GPT-4o |
|---|---|---|---|
| Gemini 2.0 | Standard | 0.24 Wh | 0.7× |
| GPT-4o | Standard | 0.34 Wh | 1× (baseline) |
| o3 / o1 | Reasoning | ~10–18 Wh | ~30–53× |
| DeepSeek-R1 | Reasoning | 23.8 Wh | ~70× |
Source: arXiv:2505.09598 (Jegham et al., 2025). Figures are estimates based on benchmark inference; actual consumption varies with hardware configuration and prompt length.
Why teams miss it
Token counts for reasoning models include the "thinking" tokens generated internally — tokens your users never see, but that the model bills and that the infrastructure processes. Most LLM observability tools surface latency and cost; few surface CO₂ per call, and almost none break it down by model family in a way that triggers an alert when you quietly swap one endpoint.
A typical migration path looks like this: a feature uses GPT-4o for six months. Someone benchmarks o3 and gets better evals. The endpoint is swapped in a one-line config change. Monthly carbon for that feature increases 40–70×. No one notices until the next CSRD export — or the next investor ESG question.
How to simulate before you migrate
Before switching a production feature to a reasoning model, you can estimate the carbon impact in three steps:
- Get your current baseline. From your existing /track data, pull the monthly token volume and gCO₂e total for the feature (filter by model +
tenant_idif needed). - Apply the multiplier. Use the /api/v1/estimate endpoint (public, no auth) with the new model id and your average prompt + completion token counts. The returned gCO₂e will reflect the new coefficient — compare against your baseline.
- Project at scale. Multiply by your monthly call volume. If the resulting annual CO₂ crosses a materiality threshold for your CSRD inventory, you have the data to justify a selective deployment (e.g. reasoning only for high-value queries, standard model elsewhere).
What to do with reasoning models
The answer is not to avoid reasoning models — it is to use them deliberately. Useful patterns:
- Route by task complexity. Reserve reasoning endpoints for queries that demonstrably benefit (legal drafting, code review) and use a standard model for everything else. A simple intent classifier at the gateway level can automate this.
- Set a carbon budget per feature. Track gCO₂e per feature via
tenant_idand model in your dashboard. When a model swap doubles the line item, the dashboard catches it before the CSRD export does. - Document the trade-off. For CSRD disclosures, being able to show that you evaluated the carbon impact of a model choice — and selected accordingly — is materially stronger than having no data at all.
Sources & further reading
- arXiv:2507.11417 — Quantifying Energy Consumption and Carbon Emissions of LLM Inference via Simulations
- arXiv:2505.09598 — How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference (Jegham et al., 2025)
- ScienceDaily — Thinking AI models emit 50× more CO₂ (June 2025)
External pages are independent; carbon-llm does not endorse or control third-party content.