Mission
Inference waste should be visible and stoppable.
Every model call runs on hardware. In production, that usually means a data center drawing power from a grid whose carbon intensity varies by region and time of day, with cooling systems whose water demand varies by facility. For local inference, it means a GPU on a desk or a server in a building.
Most teams can see the bill. They cannot attribute the physical footprint by feature, customer, or workflow, and they usually cannot stop the part that produces nothing.
Vetch makes that waste visible, then stops confirmed stalled loops before they compound.
The same inefficiency, two units
Wasted spend
A stalled agent loop calls the same model 80 times with near-zero output. Each call is billed. None produce useful work.
Wasted energy
Those same 80 calls draw inference compute on serving hardware. They consume electricity, carry an avoidable carbon footprint, and may carry cooling water demand. For zero output.
These are not separate problems. They are the same inefficiency measured in different units. In most waste cases, reducing one reduces the other.
The problem
The inference fleet is growing fast. Efficiency visibility is not keeping up.
A stalled agent loop looks identical to productive traffic at the billing level. A RAG pipeline sending 50 chunks looks like a legitimate call. Without tagging by feature, customer, and workflow from the start, aggregate cost and energy numbers are unactionable: you can see the total, but not which part to fix.
No call-level energy visibility
Providers report tokens. They do not report energy. Teams building on top of APIs have no direct line of sight into the compute energy their traffic consumes.
No attribution by default
Without tags by feature, customer, workflow, and environment, aggregate cost and energy numbers tell you something happened. They do not tell you what to fix.
Waste is structurally invisible
A stalled agent loop looks like normal traffic at the billing level. A RAG pipeline stuffing 50 chunks looks like a legitimate call. The waste is real; the signal is absent.
Controls lag visibility
Even teams that detect waste often have no production path to stop it. Observability and intervention are treated as separate problems. We think they belong in the same tool.
What we measure
Cost, energy, carbon, and water. Estimated per call. Without storing prompts or completions.
Every inference event Vetch captures includes an estimated energy draw in watt-hours, a carbon estimate in grams of CO₂ equivalent, a water consumption estimate in liters, and a cost estimate from published pricing. All derived from metadata such as token counts, model, provider, region, finish reason, and visible output character count. No prompt or completion text is stored; output text may be counted, then discarded.
Energy (Wh) and Water (L)
Per-model empirical measurements from Jegham et al. (2025): direct power observations on hardware running commercial inference. Provider-specific PUE is applied from vendor sustainability reports. Water is estimated from energy × WUE, with provider and regional fallbacks where available.
Carbon (gCO₂e)
Energy × PUE × regional grid carbon intensity. Grid intensity is fetched live from Electricity Maps where an API key is configured, falling back to a regional cache, then static averages. Each estimate carries a signal quality: live, delayed, or blind.
Cost (USD)
From provider-published token pricing bundled with the SDK, optionally refreshed via a remote registry. Used as the primary waste signal: cost and energy waste are correlated, and cost signal is the sharper alert.
Attribution
All four dimensions accumulate per tag: by feature, customer, environment, workflow. Tag every call and cost, energy, carbon, and water become attributable rather than aggregate.
Honest caveats
Energy, carbon, and water estimates are directional. Token counts are a proxy for compute work, not a direct power measurement. Grid carbon intensity is live where an API key is configured and regional-average where it is not. Water estimates use WUE values, which are facility averages and do not reflect real-time cooling conditions. The signal quality indicator on each event shows which applied. These numbers suit FinOps analysis, engineering decisions, and sustainability trend reporting. They are not appropriate for standalone carbon certification, water accounting, or regulatory disclosure. The methodology is versioned and published in the SDK repository.
What we build
Vetch is the core. The rest follows from the same thesis.
Inference waste is simultaneously a cost problem, an engineering problem, and an energy problem. Making it visible is step one. Stopping the wasteful part in production is step two. Both require the same call-level metadata.
Vetch SDK
Open-source Python SDK. Instruments LLM API calls without storing content, detects waste patterns from metadata: stalled loops, RAG bloat, prompt caching gaps, excessive generation, post-completion drift, context snowballing, and invisible output burn. Stores estimated energy and carbon per call, and stops confirmed stall loops before they compound: warn, kill, or reroute for detected stall patterns, with fail-open defaults so instrumentation never blocks production traffic if Vetch fails.
Research track
Cloud Kettle Index
A public index making data center power demand legible in human-scale units: megawatts of continuous draw expressed as kettle-boils per second. Starts with Great Britain and DC + Mid-Atlantic. The same thesis applied at grid scale.
Why now
The window where instrumentation habits form is open. Briefly.
Teams adopting LLMs in production today are making decisions about tagging, attribution, session structure, and what gets measured. Those decisions will be hard to change once systems scale. A team that instruments well now has visibility into cost and energy from the start. One that doesn’t is playing catch-up from a much harder position.
The right moment to build energy visibility into a production LLM system is when it is first being built. Not as a retrofit. Not as a compliance exercise. As a default part of how inference is operated.
That is what Vetch is designed to make easy.
There are two science fiction stories about AI infrastructure.
In one, the systems scale without limit. They consume silently. Nobody knows how much. Nobody asks. The bill is someone else’s problem, until it isn’t.
In the other, AI systems account for what they cost to run. They stop when they have nothing useful to produce. They account for themselves the way a good engineer accounts for memory: not obsessively, but honestly.
The second version is not fundamentally harder to build. It just requires the instrumentation to exist from the start.
We are at the moment where the choice is still open.
