Prismatic Labs / Vetch
Stop runaway inference.
Vetch detects stalled agents, RAG bloat, retry storms, and unattributed spend — then warns, kills, reroutes, or throttles wasteful inference before it burns budget, latency, energy, and carbon.
pip install vetch
The problem
Old cloud waste was idle. AI waste is active.
Provider dashboards show total spend by model. They cannot attribute cost by feature, customer, or workflow — and they cannot stop the next occurrence. Vetch can.
| Pattern | What it looks like | Status |
|---|---|---|
| Stalled agent loop | Agent iterating without meaningful output progress | ✓ Implemented |
| RAG bloat | Retrieval context overwhelming the prompt with low-signal content | ✓ Implemented |
| Prompt cache misses | Repeated prompt structures that could be cached but aren’t | ✓ Implemented |
| Unattributed spend | Inference cost not tied to a feature, customer, or workflow | ⚠ Partial |
| Retry storm | Burst of repeated failed or near-identical calls | 🔜 Planned |
| Zombie inference | Active calls past expected session completion | 🔜 Planned |
| Premium model overuse | Expensive model used for tasks a cheaper one handles | 🔜 Planned |
Get started
One import. Vetch instruments all LLM calls automatically.
Detect and stop stalls
import vetch
vetch.instrument(region="us-east-1", tags={"service": "chat-api"})
vetch.set_stall_action("kill") # or "warn" or "reroute"
# Your existing agent loop — unchanged.
# Vetch raises StallDetected before the next wasted call.
try:
response = client.chat.completions.create(...)
except vetch.StallDetected:
session.clear_stall() # human-in-the-loop, then resume
Attribute cost, energy, and carbon
with vetch.wrap(tags={"feature": "rag-search", "customer": "acme"}) as ctx:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
)
print(f"Cost: ${ctx.event['estimated_cost_usd']:.5f}")
print(f"Energy: {ctx.event['estimated_energy_wh']:.4f} Wh")
print(f"Carbon: {ctx.event['estimated_carbon_g']:.4f} gCO2e")
print(f"Quality: {ctx.event['signal_quality']}") # "live" | "estimated"
Vetch does not read prompts or completions — only model, tokens, latency, region, tags, and session context.
Reroute to a cheaper model automatically
vetch.set_stall_action("reroute", fallback_model="gpt-4o-mini")
# On STALL-001, Vetch silently substitutes gpt-4o-mini for the stalled call.
# Your code sees a normal response.
Attribution
Know exactly which feature, customer, and workflow is driving your LLM bill.
Tag every call with any keys you define. Vetch accumulates cost, energy, and carbon per tag combination and per session — making “our LLM bill went up” into “the RAG search feature for enterprise customers is the culprit.”
Built-in tag keys
feature · customer · team · workflow · environment
Session scope
Parent/child hierarchy, distributed propagation via W3C-compatible HTTP headers.
Required tags
vetch.require_tags(["customer"]) — flag untagged calls before they become unattributed spend.
Observability
Works with your existing observability stack.
Fail-open
If Vetch encounters an error, your LLM call proceeds normally. Observability never blocks inference.
Fail-loud
Every log includes a signal_quality field. Stale or estimated data is clearly flagged.
Privacy-first
Zero access to prompts or completions. Only model, tokens, latency, region, and tags are observed.
Get started
Know exactly where your inference spend is going — in 7 days.
A structured adoption path. By day 7 you’ll have attributed spend by feature, customer, and workflow — and you’ll know which waste patterns are active and what they’re costing you.
Day 1 — Instrument
One import. Every LLM call across all providers tracked automatically — no other code changes.
import vetch
vetch.instrument(
region="us-east-1",
tags={"service": "chat-api"}
)
Days 1–7 — Tag and observe
Attribute spend to features, customers, and workflows. Run in warn-only mode — advisories fire without any intervention. See which patterns are active.
Day 7 — Audit and act
Run vetch audit for advisory events and a session token summary. Promote confident advisories to kill or reroute. Full attribution reports (spend by feature, customer, model) are in active development.
vetch audit
Enterprise and compliance
CSRD and SEC Scope 3 reporting
Every inference call produces per-call energy (Wh) and carbon (gCO2e) with documented uncertainty bounds — structured for EU CSRD and US SEC Scope 3 disclosure. Run vetch methodology for full provenance.
FinOps attribution
Attribute inference spend by feature, customer, team, workflow, and environment. Move from “the LLM bill went up” to “the RAG search feature for enterprise customers is the culprit” — in a week.
Zero prompt capture
Vetch observes metadata only — model, tokens, latency, region, and tags. No prompt or completion content ever leaves your execution environment. Air-gapped operation supported.
Ready to stop runaway inference?
Two lines of code. No prompt access. Fail-open. Start the 7-day audit or talk to us about your production setup.
