The efficiency gap between the most frugal and most power-hungry frontier LLMs currently exceeds 200×. LLaMA-3.2-1B consumes roughly 0.07 Wh per short query. DeepSeek-R1 consumes 23.8 Wh for the same task. This is not a rounding error — it is a carbon decision that most product teams make by default, not by design.
The benchmark landscape in 2025
arXiv:2505.09598 (Jegham et al.) published energy and carbon benchmarks for a broad set of inference endpoints in 2025. The headline finding: per-query energy consumption spans more than two orders of magnitude across the model landscape.
| Model | Size / type | Wh / short query | Best for |
|---|---|---|---|
| LLaMA-3.2-1B | 1B params, local | 0.07 Wh | Classification, routing |
| Gemini 2.0 Flash | Frontier, optimised | 0.24 Wh | General tasks, speed |
| GPT-4o | Frontier | 0.34 Wh | General quality baseline |
| LLaMA-3.1-70B | 70B params | ~0.93 Wh | Mid-range quality, open weights |
| o1 / o3 (min) | Reasoning, light | ~4–10 Wh | Complex reasoning, targeted use |
| DeepSeek-R1 | Reasoning, full | 23.8 Wh | Maximum reasoning depth |
Source: arXiv:2505.09598 (Jegham et al., 2025). Figures are model-level estimates; actual consumption varies with prompt length, hardware, and data center configuration.
The task-model mismatch is where carbon hides
Most engineering teams select a model once — at architecture time — and run it for everything. The result is a systematic mismatch: frontier models doing classification tasks that a 1B parameter model would handle with equivalent accuracy; reasoning models doing summarisation that a standard model would complete at 1/50th the carbon cost.
Google reported a 33× efficiency improvement for Gemini over a twelve-month period ending in mid-2025. This suggests the efficiency frontier is moving fast — meaning teams that locked in a model selection in 2023 or 2024 may be systematically over-spending on carbon without revisiting the decision.
How to approach task-based routing
Task-based routing means selecting the model based on what the task actually requires, rather than defaulting to the best available model. A practical framework:
- Classify your call types. Audit your production LLM calls and group them: classification / intent detection, summarisation, generation, structured extraction, complex reasoning. Each category has a different quality threshold.
- Benchmark quality for each category. Run your eval suite (or a representative sample) against a range of model tiers. For many classification and extraction tasks, a 7B–13B model matches frontier model quality.
- Measure the carbon delta. Use
/api/v1/estimate(public, no auth) with the alternative model id and your average token counts. Compare gCO₂e per call across model candidates. - Route in production. Implement a lightweight classifier or rules-based router at the gateway layer. Tools like LiteLLM, PortKey, or a simple middleware function can route based on request metadata without adding meaningful latency.
What the 200× gap means for CSRD reporting
For companies in scope for CSRD / ESRS E1, model selection is a carbon reduction lever that can be documented and reported as part of a climate transition plan. If you can demonstrate that you evaluated model alternatives and chose a lower-intensity option where quality was equivalent, that is a concrete and verifiable reduction action — far stronger than a generic "we are working on green AI" statement.
The data requirement is straightforward: per-model emission totals over time, split by use case where possible. A dashboard that shows gCO₂e per model per month gives you both the baseline and the evidence of improvement after a routing change. See how to structure this data for ESRS E1 reporting.
A note on transparency
Not all providers publish methodology for their energy and carbon figures. Claude (Anthropic) and several other frontier models do not have publicly disclosed per-token emission factors as of early 2026. Where primary data is unavailable, documented benchmarks with explicit confidence labels (Measured / Benchmarked / Estimated) are the accepted fallback under GHG Protocol and ESRS E1 guidance. The carbon-llm methodology labels every coefficient accordingly.
Sources & further reading
- arXiv:2505.09598 — How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference (Jegham et al., 2025)
- Hannah Ritchie — What's the carbon footprint of using ChatGPT or Gemini? (August 2025)
- AI Energy Calculator — GPT-4 & Llama carbon footprint analysis
External pages are independent; carbon-llm does not endorse or control third-party content.