/
Blog
·9 min read

RAG pipeline energy: retrieval, reranking, and when extra LLM passes pile on CO₂

Beyond completion tokens: embeddings, vector search, and verifier models each draw power—why workflow design dominates the carbon story for retrieval-augmented generation.

Retrieval-augmented generation (RAG) is often sold as a way to get smaller models and fresher facts. From a carbon perspective, the right question is whether your end-to-end pipeline—embeddings, retrieval, reranking, and one or more LLM calls—beats a single large-model call for the same business outcome.

Try the tools (measurement)
Estimate CO₂e from tokens, read coefficients, or see values in the browser — editorial series pillar 1.

Why token counts alone mislead

Standard coefficient methods (see tokens → CO₂e) attach emissions to prompt and completion tokens on the final generation call. A RAG stack also spends energy on embedding models, vector search, optional cross-encoder reranking, and sometimes extra LLM passes for grounding or verification. Recent work on domain-specific climate chatbots decomposes inference-time use into retrieval, generation, and hallucination-check components and finds that more “agentic” pipelines can materially raise energy without proportional quality gains—design matters more than the buzzword.

CPU-bound vs GPU-bound steps

Retrieval is often CPU- or network-bound; generation and verification passes sit on accelerators. That split matters for dashboards: if you only meter the chat completion, you under-count RAG. For CSRD-style narratives, document the boundary you report (API-only vs full pipeline) and point reviewers to your methodology page.

Practical ways to keep RAG lean

  • Tighten retrieval before widening context: better recall/precision reduces prompt tokens and wasted generation.
  • Avoid redundant verifier LLM calls unless measurement shows they change outcomes; each pass adds GPU time.
  • Reuse embeddings for stable corpora and cache hot query results where privacy allows—semantic overlap in production queries is high in many domains.

Connect to disclosure

If generative AI is material, Scope 3 discussions should reference activity data you can defend—often provider usage logs plus an internal note on RAG overhead. Our Scope 3 first steps article lines up the reporting angle.

Sources & further reading

External pages are independent; carbon-llm does not endorse or control third-party content.

Disclaimer. RAG architectures vary; treat cited studies as illustrations, not guarantees for your stack.