/
Blog
·9 min read

Reasoning Models Are Quietly Destroying Your Carbon Budget

o3 and DeepSeek-R1 emit 30–50× more CO₂ per call than standard models. Here is how to simulate the carbon impact of a model migration before you deploy it.

Reasoning models like o3, DeepSeek-R1, and Gemini Thinking deliver measurably better results on complex tasks. They also emit 30 to 50 times more CO₂ per call than a standard GPT-4o request. Most engineering teams find this out only when a CSRD questionnaire lands — or never.

What the research says

A June 2025 study published in arXiv:2507.11417 quantified inference emissions across model families. The finding that attracted the most attention: reasoning-enabled models averaged 543 "thinking" tokens per query, compared to 37 for concise models. Since energy consumption scales roughly linearly with tokens processed, the carbon gap is large and predictable.

A complementary benchmark (arXiv:2505.09598) put concrete Wh figures on it:

ModelTypeWh / short queryRelative to GPT-4o
Gemini 2.0Standard0.24 Wh0.7×
GPT-4oStandard0.34 Wh1× (baseline)
o3 / o1Reasoning~10–18 Wh~30–53×
DeepSeek-R1Reasoning23.8 Wh~70×

Source: arXiv:2505.09598 (Jegham et al., 2025). Figures are estimates based on benchmark inference; actual consumption varies with hardware configuration and prompt length.

Why teams miss it

Token counts for reasoning models include the "thinking" tokens generated internally — tokens your users never see, but that the model bills and that the infrastructure processes. Most LLM observability tools surface latency and cost; few surface CO₂ per call, and almost none break it down by model family in a way that triggers an alert when you quietly swap one endpoint.

A typical migration path looks like this: a feature uses GPT-4o for six months. Someone benchmarks o3 and gets better evals. The endpoint is swapped in a one-line config change. Monthly carbon for that feature increases 40–70×. No one notices until the next CSRD export — or the next investor ESG question.

How to simulate before you migrate

Before switching a production feature to a reasoning model, you can estimate the carbon impact in three steps:

  1. Get your current baseline. From your existing /track data, pull the monthly token volume and gCO₂e total for the feature (filter by model + tenant_id if needed).
  2. Apply the multiplier. Use the /api/v1/estimate endpoint (public, no auth) with the new model id and your average prompt + completion token counts. The returned gCO₂e will reflect the new coefficient — compare against your baseline.
  3. Project at scale. Multiply by your monthly call volume. If the resulting annual CO₂ crosses a materiality threshold for your CSRD inventory, you have the data to justify a selective deployment (e.g. reasoning only for high-value queries, standard model elsewhere).

What to do with reasoning models

The answer is not to avoid reasoning models — it is to use them deliberately. Useful patterns:

  • Route by task complexity. Reserve reasoning endpoints for queries that demonstrably benefit (legal drafting, code review) and use a standard model for everything else. A simple intent classifier at the gateway level can automate this.
  • Set a carbon budget per feature. Track gCO₂e per feature via tenant_id and model in your dashboard. When a model swap doubles the line item, the dashboard catches it before the CSRD export does.
  • Document the trade-off. For CSRD disclosures, being able to show that you evaluated the carbon impact of a model choice — and selected accordingly — is materially stronger than having no data at all.

Sources & further reading

External pages are independent; carbon-llm does not endorse or control third-party content.