Compress prompts before inference
Use opt-in prompt compression after call_begin to reduce input tokens. UsageTap records saved tokens and estimated dollars without storing raw prompt content.
AI apps leak margin when every request uses the strongest model, repeated context is billed as fresh input, and abusive bursts reach the vendor unchecked. UsageTap gives engineering, product, and finance one place to reduce spend while preserving customer experience.
Use opt-in prompt compression after call_begin to reduce input tokens. UsageTap records saved tokens and estimated dollars without storing raw prompt content.
Send cachedInputTokens when vendors report prompt cache hits, so UsageTap separates uncached input from discounted cache-read input in cost reporting.
Use entitlements and vendor hints to route routine work to standard models while reserving premium models and heavier reasoning for calls that need them.
Set quotas, burst limits, and blocking policies so runaway loops, scripted abuse, or accidental spikes stop before they become LLM invoices.
Cost controls
UsageTap keeps the spend-reduction levers close to the usage lifecycle. Begin the call, choose the appropriate model strength, compress when useful, report cache hits, and apply plan policy when a customer exceeds limits.
const begin = await usageTap.beginCall({
customerId,
feature: "report.generate",
requested: { standard: true, premium: true },
});
const model = begin.data.allowed.premium && needsDeepReasoning
? "gpt-5"
: "gpt-5-mini";
const compressed = await usageTap.promptCompress({
callId: begin.data.callId,
input: reportPrompt,
});
const response = await openai.responses.create({
model,
input: compressed.compressedInput,
});
await usageTap.endCall({
callId: begin.data.callId,
modelUsed: model,
inputTokens: response.usage?.input_tokens ?? 0,
cachedInputTokens:
response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
responseTokens: response.usage?.output_tokens ?? 0,
});The dashboard shows input, cached input, output, model, cost, compressed tokens saved, and estimated compression dollars saved on recent calls. That makes optimization visible to engineering and finance without a separate spreadsheet.
Customers still get the right answer, but abusive spikes can be blocked, routine calls can use lower-cost models, and repeated context can benefit from cache-read pricing. Spend becomes explainable before it becomes a billing surprise.
Combine prompt compression, cache-aware pricing, model strength selection, and abuse blocking in the same usage lifecycle.