Reducing API costs by 40%: a practical guide

Four techniques that reduce AI API costs without reducing output quality. With real numbers from current model pricing.

AI API costs have fallen 95% since 2020. The cheapest capable model today costs less than 2% of what the best model cost four years ago. And yet most applications spend 3-10x more than they need to — not because models are expensive, but because the prompts are inefficient.

Here are the four techniques that produce the largest savings, in order of impact.

1. Prompt caching (40-90% reduction on repeated prompts)

If your system prompt or document context is the same across many requests, you are paying full price for it every time. Prompt caching stores the first portion of your prompt and charges 90% less for subsequent requests that reuse it.

A 2,000-token system prompt sent 100,000 times without caching: 200 million tokens at input price. With caching enabled after the first request: roughly 20 million tokens at cache read price. Enable it today.

2. Right-sizing model selection (50-80% reduction)

The most capable model is not always the right model. For classification tasks, entity extraction, or simple question answering, a smaller, cheaper model produces equivalent results at a fraction of the cost.

The practical test: run your 20 most common request types through a cheaper model. If quality is acceptable on 80% of them, route those to the cheaper model.

3. Output length control (20-40% reduction)

Output costs 3-5x more than input on most models. Adding “be concise” to your system prompt reduces output length by 20-40% on average, with minimal quality impact on most tasks.

4. Batch processing (50% reduction, asynchronous only)

For non-real-time workloads — document analysis, content generation, data enrichment — batch processing cuts costs in half. The tradeoff: no real-time response.

Combining techniques

These techniques compound. Prompt caching + right-sized model + concise output instructions can reduce costs by 70-80% on typical workloads.

Start here: Enable prompt caching on your highest-volume use case. It requires a single configuration change and typically produces immediate cost reduction.