Use Case

Reduce AI application spend without weakening the product

AI apps leak margin when every request uses the strongest model, repeated context is billed as fresh input, and abusive bursts reach the vendor unchecked. UsageTap gives engineering, product, and finance one place to reduce spend while preserving customer experience.

Compress prompts before inference

Use opt-in prompt compression after call_begin to reduce input tokens. UsageTap records saved tokens and estimated dollars without storing raw prompt content.

Account for cached input tokens

Send cachedInputTokens when vendors report prompt cache hits, so UsageTap separates uncached input from discounted cache-read input in cost reporting.

Select the right model strength

Use entitlements and vendor hints to route routine work to standard models while reserving premium models and heavier reasoning for calls that need them.

Block abuse before it reaches the vendor

Set quotas, burst limits, and blocking policies so runaway loops, scripted abuse, or accidental spikes stop before they become LLM invoices.

Cost controls

Four controls, one call record

UsageTap keeps the spend-reduction levers close to the usage lifecycle. Begin the call, choose the appropriate model strength, compress when useful, report cache hits, and apply plan policy when a customer exceeds limits.

  • input tokens
  • cached input tokens
  • prompt tokens saved
  • estimated compression dollars saved
  • model used
  • customer and feature
Example flow
const begin = await usageTap.beginCall({
  customerId,
  feature: "report.generate",
  requested: { standard: true, premium: true },
});

const model = begin.data.allowed.premium && needsDeepReasoning
  ? "gpt-5"
  : "gpt-5-mini";

const compressed = await usageTap.promptCompress({
  callId: begin.data.callId,
  input: reportPrompt,
});

const response = await openai.responses.create({
  model,
  input: compressed.compressedInput,
});

await usageTap.endCall({
  callId: begin.data.callId,
  modelUsed: model,
  inputTokens: response.usage?.input_tokens ?? 0,
  cachedInputTokens:
    response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
  responseTokens: response.usage?.output_tokens ?? 0,
});

What teams see

The dashboard shows input, cached input, output, model, cost, compressed tokens saved, and estimated compression dollars saved on recent calls. That makes optimization visible to engineering and finance without a separate spreadsheet.

What customers feel

Customers still get the right answer, but abusive spikes can be blocked, routine calls can use lower-cost models, and repeated context can benefit from cache-read pricing. Spend becomes explainable before it becomes a billing surprise.

Control AI cost before the invoice arrives

Combine prompt compression, cache-aware pricing, model strength selection, and abuse blocking in the same usage lifecycle.

Talk through your AI spend