Interest in the environmental footprint of large language models (LLMs) has grown alongside adoption. Researchers now publish life-cycle assessments (LCAs), end-to-end modeling frameworks, and empirical studies on training, fine-tuning, and inference. This post maps a selection of influential papers and reports—without pretending the field has settled on a single number for “one ChatGPT answer.” Methodologies differ; what matters for practitioners is transparent activity data (tokens, energy, region) and documented emission factors.
Why headlines disagree
Comparisons of LLMs to human work, to cars, or to transcontinental flights use different functional units (per page, per query, per training run), different boundaries (operational only vs. embodied hardware), and different grid carbon intensities. A peer-reviewed LCA may therefore reach different conclusions than a blog post that extrapolates from one provider’s blog or a single experiment. The studies below are useful as orientation, not as interchangeable coefficients.
Comparative LCAs and “human vs. model” framing
Reconciling the contrasting narratives on the environmental impact of large language models (Scientific Reports, 2024) applies comparative LCA across energy, carbon, water, and economic metrics, discussing models such as Llama-3-70B and Gemma-2B-it in a structured scenario. The paper illustrates how narratives can diverge depending on system boundaries and assumptions—useful background when your stakeholders quote seemingly contradictory “grams per page” figures.
Small language models and transparency
Assessing the carbon footprint of language models: Towards sustainability in AI (Belcak et al., Science of the Total Environment, 2025) emphasizes transparent, standardized reporting of energy use and compares emissions from training—particularly for smaller models such as TinyLlama and nanoGPT—with inference. The takeaway for product teams: model choice and use-case fit affect footprint as much as raw parameter counts.
End-to-end modeling: LLMCarbon
LLMCarbon: Modeling the End-To-End Carbon Footprint of Large Language Models (Faiz et al., arXiv, 2023) proposes a framework that spans training, inference, experimentation, and storage, separating operational and embodied carbon and discussing data-center PUE and carbon intensity of electricity. Follow-on conference work extends refined lifecycle stages (e.g. storage embodied carbon); see e.g. Research on carbon footprint in the whole process of LLM based on refined modeling (ACM ADMIT 2024) for an example of building on that lineage.
Landmark training studies: BLOOM and Strubell et al.
Estimating the Carbon Footprint of BLOOM, a 176B Parameter Model (Journal of Machine Learning Research, 2024) remains a reference for separating training energy, dynamic grid emissions, and embodied hardware—reporting on the order of 25 tonnes CO₂eq for training-energy-related emissions in their accounting (exact headline numbers should be read in the paper’s tables).
Carbon Emissions and Large Neural Network Training (Patterson et al., 2021) showed that location of training (grid mix) and hardware choices can change footprint by orders of magnitude—motivating today’s focus on region-aware factors and transparent reporting in ML papers.
Green AI, CodeCarbon, and classroom-scale experiments
Green AI: exploring carbon footprints, mitigation strategies, and trade offs in large language model training (Discover Artificial Intelligence, Springer, 2024) uses tools such as CodeCarbon to track CO₂ during training and fine-tuning and discusses lighter architectures (e.g. ALBERT, DistilBERT) as mitigation levers—relevant when your product can swap a generalist LLM for a smaller model on some routes.
Reviews, tools, and practitioner guides
How to estimate carbon footprint when training deep learning models? A guide and review (PMC, 2024) surveys measurement tools (CarbonTracker, MLCO₂, Green Algorithms, etc.) and stresses that large training runs can reach hundreds of tonnes CO₂eq in published estimates—again, highly scenario-dependent.
Inference and “energy cost of an answer”
Training dominates headlines, but inference scales with usage. Energy costs of communicating with AI (Zhao et al., Frontiers in Communication, 2025) evaluates multiple LLMs on MMLU-style tasks and relates accuracy, token usage, and CO₂eq—illustrating trade-offs between model size and environmental cost at query time.
Commentary and synthesis pieces—e.g. Cutter Consortium, Columbia Climate School—help communicate scale to non-specialists but should be cross-checked against primary studies when you need audit-grade citations.
Adjacent evidence: recommender systems
Green Recommender Systems: Understanding and Minimizing the Carbon Footprint of AI-Powered Personalization (2025) is not LLM-specific but demonstrates how experimental norms and hardware choices changed emissions across a decade of RecSys papers—useful analog for teams running repeated benchmarks or A/B tests on GPUs.
What to do in your stack
- Meter tokens and model IDs from provider APIs—the same activity data these papers argue should be public in research.
- Apply documented factors (LCAs, grid regions) and version them when methodology updates.
- Separate test vs. productionso sustainability reporting does not conflate R&D experiments with customer-facing inference.
Disclaimer. This overview is educational and not a substitute for LCA studies tailored to your infrastructure. Figures cited in third-party summaries (training “equal to five cars,” etc.) vary by methodology; always refer to the original paper for definitions and boundaries.
Further reading (selected links)
- Nature Scientific Reports — comparative LLM LCA (2024)
- Belcak et al. — SLM carbon footprint (2025)
- LLMCarbon (arXiv)
- Green AI — Discover AI (Springer)
- Patterson et al. — Carbon emissions and large NN training
- Refined LLM lifecycle modeling — ACM ADMIT 2024
- BLOOM carbon footprint — JMLR 2024
- PMC guide — estimating training footprint
- Zhao et al. — inference energy costs
- HDSR — Carbon emissions in the tailpipe of generative AI