markdown
*Post 4 of 4 in the **Building With LLMs** series. [See all posts](blog-series-index.md). Previous post: [Tokens and Temperature, in Plain English](blog-series-3-tokens-and-temperature.md)*
---
We just finished an end-to-end audit of an LLM-powered web app — 23 prompts, 21 distinct AI calls, a mix of Anthropic and OpenAI fallbacks. It started as "let's check if we're spending too much" and turned into something more useful: a step-by-step way to think about prompts, models, and token costs without losing your mind.
This is what I'd tell a friend who's about to ship their first LLM app. None of it is theoretical. All of it came from finding actual problems in actual code — like a prompt running on the most expensive model in the catalog to do a job a 4× cheaper model handles fine, or a system prompt copy-pasted across six files that drift apart over time, or fallback API calls quietly running with no token cap because nobody set one.
If you're building anything with prompts and you care about either quality or cost, the steps below will pay back the time it takes to read them. None of them take more than a couple of hours each, and they're roughly in the order you'd want to do them. The numbers and anecdotes are from a real audit on a real codebase, not invented.
If the vocabulary trips you up — what a token is in bytes, what temperature actually does — read [post 3 in this series](blog-series-3-tokens-and-temperature.md) first.
---
## Step 1 — Measure before you optimize
You cannot fix what you cannot see. Before you change a single prompt, write down every AI call your app makes. For each one, capture six things: the file it lives in, the model, the max-tokens limit, the temperature, the prompt file, and an estimate of how often it runs. A spreadsheet works. A markdown table works better because you can put it in your repo.
We did this and immediately found four calls in our OpenAI fallback path that had no `max_tokens` set at all. They were silently using model defaults — 4,096 tokens or more — and we'd have never noticed until the bill arrived. Two prompts were running on the most expensive model for tasks that didn't need it. One important call was running at default temperature (≈1.0) for an analytical task that should have been near-zero.
The audit doc became the source of truth for everything that came next. It's a couple hours of work and it's the highest-leverage thing you can do.
## Step 2 — Set every dial explicitly. Never trust defaults.
Two settings matter most: `max_tokens` and `temperature`. Both have defaults. The defaults are wrong for almost every production task.
`max_tokens` defaults are usually generous (thousands of tokens). If your task only needs 500, but the model is allowed to write 4,000, you'll occasionally pay for outputs you didn't want and didn't read. Worse, fallback paths and rarely-exercised code paths inherit defaults silently. Make `max_tokens` mandatory in your codebase the same way you'd make a SQL query require a `LIMIT`.
`temperature` defaults to 1.0 on most APIs, which is great for creative writing and terrible for analysis. Classification tasks want 0.0. Validation passes want 0.2–0.3. Brand-voice generation wants around 0.4. The marketing copy your app writes can run at 0.5–0.6. The same model produces dramatically different outputs at different temperatures — and the cost of getting this wrong is invisible randomness in your product.
A simple rule: if you ever find yourself debugging a quality issue by re-running the same prompt and getting different results, the temperature is probably set too high.
## Step 3 — Pick the cheapest model that does the job
This is where most LLM apps overspend. There's a strong instinct to reach for the smartest model "just in case." Resist it.
Modern providers offer three rough tiers. The smallest model (Haiku at Anthropic, mini variants elsewhere) is right for *classification* tasks: mapping CSV columns to fields, labeling content as on-brand or off-brand, picking which of three images to use. The middle model (Sonnet) is right for *generation* tasks where the customer will judge the quality: writing in a brand voice, extracting nuanced positioning from a website, producing a short-form post that needs to feel human. The flagship model (Opus or equivalent) is for genuinely hard reasoning — multi-document synthesis, strategic recommendations across many constraints — and you should only reach for it when you can articulate exactly what the middle model gets wrong on that task.
Three questions to ask before picking a model:
First, does weak output directly face a customer? If yes, don't go below Sonnet. A generic-sounding brand description reflects on your product, not on the model.
Second, is this classification or generation? Classification is pattern matching and the small model handles it. Generation needs the richer model.
Third, is this in a synchronous user flow, or background? Background tasks can afford a slightly slower call. Real-time UX needs the fastest model that meets the quality bar.
We found one of our prompts was running a 400-word blog generation on the flagship model. There was no quality reason for it. Downgrading to the middle model saved roughly 3× per call with no measurable drop in output. Another prompt was using the flagship to map "Product Name" to a `name` field in a CSV import — the small model does this perfectly and costs about a fifth as much.
The rule of thumb that's stuck with me: start with the small model. Upgrade only when the quality gap is visible to a customer. Never use the flagship unless you can name the specific thing the middle model gets wrong.
## Step 4 — Cap your inputs at the source
The size of your prompt is determined by what you put into it. If you pass an entire scraped HTML page into a brand-extraction prompt, you've handed the model 30,000 tokens of mostly-irrelevant markup. The model doesn't need it. It needs the headlines, the product names, the about-page paragraphs, the calls-to-action.
We built a small data shape called `SiteBrief` that pulls out exactly the signals worth analyzing: 15 headings, 10 key sentences, 6 calls-to-action, 5 proof signals (badges, testimonials, certifications). Every prompt that consumes site data takes a `SiteBrief`, not raw HTML. The result is predictable input cost regardless of how big the customer's website is.
Apply this pattern aggressively. Anywhere a prompt receives user-generated content — a blog post body, an article, a product description — there should be an explicit cap. "First 2,000 characters." "First 30 sentences." "First N items from the list." Without caps, your worst-case cost per call is unbounded, and one viral customer with a 50-page product description will pop your budget.
The same applies to RAG retrieval. If your prompt template ends with "Here are the relevant documents: …" and the relevant-documents block is unlimited, you have no idea what a single call will cost.
## Step 5 — Stop duplicating shared content across prompts
This was the most embarrassing thing we found. Six different prompts in our codebase contained the same anti-generic-phrase list — "avoid words like cutting-edge, innovative, world-class, transformative…" — and the lists had drifted apart. Some prompts banned 12 phrases, some 14, some 9. Updating the rule meant editing six files and hoping nobody missed one.
The fix is the same DRY principle you apply to code. Pull shared content into a single source of truth, then inject it into every prompt that needs it via a template variable. We use a `{{anti_generic_rules}}` placeholder that gets filled from one helper function. When marketing finds a new overused phrase, it's a one-file change.
The same pattern applied to brand-voice rendering. Six variables — tone, do-use, don't-use, style notes, guardrails, things to avoid — were being interpolated identically in five different prompts. We collapsed them into a single `renderVoiceProfile(brand)` helper and injected one `{{voice_profile_block}}` variable in every consumer prompt. Less code, less drift, slightly fewer tokens.
The token savings here are real but secondary. The maintainability win is the main prize.
## Step 6 — Use tool-use schemas, not prose JSON descriptions
If your prompt asks the model to return JSON, there are two ways to do it. The wrong way is to describe the JSON shape in prose: "Return a JSON object with fields name (string), categories (array of strings), score (number 1–5)." The right way is to use the provider's structured output feature — Anthropic calls it tool-use, OpenAI calls it function calling — and pass the schema as a formal constraint.
The wrong way fails ~5–10% of the time in subtle ways. The model emits text before the JSON. Or markdown around the JSON. Or it decides the score should be the string "high" instead of a number. Each failure means a retry, which means double the tokens and double the latency. Each near-miss means parsing logic in your code that gets brittle over time.
The right way fails almost never. The provider's runtime enforces the schema. Your downstream code can trust the shape. You also save tokens, because the schema lives in the API definition rather than being repeated in your prompt prose.
This was the single biggest robustness win in our audit. Several of our extraction prompts described tool calls in prose ("the LLM should emit a tool call to extract_brand_claims with…"). Migrating them to actual tool-use definitions cut both the input tokens and the silent retry rate.
## Step 7 — Cache the big system prompts
Anthropic and others now offer prompt caching. The idea: when you send the same long system prompt over and over, the provider stores it server-side and reuses it on subsequent calls, charging roughly 10% of the normal input cost on the cached portion. Minimum cacheable size is around 1,024 tokens.
This is one line of code per call. Most apps don't use it. You should.
The catch is the minimum size — small prompts can't be cached. So caching is most valuable on your largest, most-repeated system prompts. In our app, the brand-extraction system prompt sits around 1,400 tokens and runs on every preview. Caching saves roughly $0.004 per call, which sounds like nothing — until you multiply by thousands of previews per month.
Don't try to cache everything. Look at your audit table from Step 1, sort by `system prompt size × call frequency`, and add caching to the top two or three. That's where the savings are.
## Step 8 — Make conditional sections actually conditional
A lot of prompts have template sections that are theoretically optional. "Active campaigns: {{campaigns_block}}." "Product catalog: {{products_block}}." When the brand has no campaigns or no products, what does your code render in those slots?
If the answer is `(none)` or an empty string left in place, you're paying tokens for nothing. The model has to read those headings and figure out they're empty. That's overhead on every call, multiplied by every brand without that data, multiplied by every call you make per brand per month.
Audit your template builders. Make sure empty sections are *dropped*, not rendered with placeholder text. The pattern you want is `if (block) { append } else { skip }`, not `append(block || "(none)")`. Small per call, large at scale.
This is also a quality win: an empty section creates noise the model has to ignore, and models aren't perfect at ignoring noise.
## Step 9 — Split multi-concern prompts into focused passes
We had a website-audit prompt that did everything in one call: messaging audit, SEO audit, per-page summaries, recommendations. It was over 1,600 words long and ran on the most expensive model because one part of the work needed it.
The problem with bundled prompts is that one setting has to fit all the concerns. Messaging audit is a generative judgment task that wants temperature 0.3 and the middle model. SEO audit is structured classification that wants temperature 0 and the small model. By bundling them, we were running the SEO half on a model that was 4× more expensive than necessary.
Splitting it into two calls cut total cost by about 40% and improved quality on both halves. Each call now uses the right model and the right temperature. This is the case for breaking up your big prompts: not just to save tokens, but to use the right tool for each piece of work.
There's a counter-pull here. More calls means more latency (especially if they're sequential) and more orchestration code. Don't split for its own sake. Split when one prompt is doing two genuinely different jobs, when those jobs would prefer different settings, or when one piece of the bundle is forcing you to use a more expensive model than the rest needs.
## Step 10 — Validators should diff, not regenerate
This is a subtle pattern that took us a while to see. We had a "generate then validate" structure in several places — produce a brand profile, then run a second LLM call to fix anything generic in it. Sensible design.
But our validators were re-receiving the entire input: the original site text, the generated profile, the schema description, all the rules. The validator was, in effect, re-doing the whole job from scratch with a small correction overlay.
The cleaner pattern is to have validators *diff*. The validator gets the candidate output and a brief "here's what could be wrong" rubric. It identifies weak spots and rewrites only those. It doesn't need the original input. It doesn't need the schema explanation again. The result is the same correction quality at roughly half the input tokens.
If you have any "generate then validate" flows in your app, this is one of the highest-leverage patterns to apply. Validation passes are often 30–40% of pipeline cost. Cutting their input in half is a meaningful saving.
## Step 11 — Quality moves that cost a few tokens but earn them back
Most of the steps so far are about saving tokens. A handful of small quality moves cost a few tokens and earn them back through better outputs and fewer retries.
The first is **contrastive examples**. If your prompt says "avoid generic phrases," show the model what generic looks like *and* what good looks like. "BAD: 'We're an innovative cutting-edge brand.' BETTER: 'We make wool sneakers carbon-negative since 2016.'" Two contrastive examples typically improve adherence by more than enough to justify the extra 50 tokens.
The second is **explicit decision criteria**. Instead of "judge whether the brand is on-tone," tell the model how to judge: "Score 1–5 on (a) specificity, (b) consistency-with-voice, (c) presence-of-banned-phrases. Reject if any score is below 3." Models do better with rubrics than with vibes.
The third is **single-line personas**. "You are a brand strategist." is enough. Multi-sentence personas with credentials and biographies don't measurably improve output. They cost tokens and add nothing.
The fourth is **negative examples for stylistic tasks**. For caption generation or voice rewriting, showing one example of bad output ("Don't write like this: …") is a stronger steer than five rules.
## Step 12 — Make this a habit, not a one-off
The audit document we built lives in our repo. It gets reviewed when we add a new AI call. It's the place we record decisions like "we tried Sonnet for X, the quality wasn't worth the cost, we kept Haiku." It's the place a new engineer learns why a particular prompt is on a particular model.
This matters because LLM costs are easy to forget. Every new feature adds prompts. Every prompt drifts. Every rate increase in model pricing moves the math. If you don't have a place where this stuff is written down, you'll re-learn the same lessons every quarter and your costs will quietly creep up.
Once a quarter or so, re-run the audit. New models will have come out. Old prompts will have changed. Your traffic mix will have shifted. The numbers will be different. Update the doc. Make the changes the doc tells you to make.
### The same-day re-audit
There's also a faster rhythm worth running, separate from the quarterly cadence. **When you ship a batch of changes from the audit — even just three or four items in a sprint — re-audit immediately, not weeks later.**
We learned this on this exact work. The first audit pass listed 12 items. The team closed four of them in two days. We ran the second audit the same day the cleanup finished. By the time the dust settled, the doc was already current — no decay, no re-discovery, no "wait, did we actually ship that or just talk about it?"
The principle is simple: memory decay is the enemy of follow-through. The same-day re-audit catches things while everyone still remembers why they did them, what tradeoffs they made, and what they meant to come back to. The quarterly cadence is for *drift detection*. The same-day cadence is for *closing what you just shipped*. Run both — they catch different things.
The mechanic itself is small. After any sprint of audit-driven work, take 30 minutes to:
1. Walk the open items list against the diff that just shipped.
2. Flip what's done from open to done in the doc.
3. Note any "almost done" partials — be honest about them — so they don't get forgotten.
4. Add anything *new* the sprint exposed (often the most valuable part — the work surfaces issues you didn't know existed).
Thirty minutes of bookkeeping makes the difference between "we have an audit doc" and "we have an audit doc that's actually useful." The doc only earns its keep when someone trusts it enough to act on it. A doc that lags reality stops being trusted within a quarter or two; a doc that stays current becomes the place people check first.
---
## A short word on what this all costs you not to do
We did the math for our app at modest scale — about 1,000 active brands, 30 captions per brand per month — which works out to 30,000 generation calls per month for that one workflow alone. The difference between running those captions on the right model and the wrong model was roughly $300 per month. Multiplied by ten workflows in a real app, multiplied by 12 months, that's a meaningful number.
But the cost isn't only in dollars. It's in latency (the wrong model is often slower), in variance (high temperature on analytical tasks produces inconsistent UX), in maintainability (six copies of the same banned-phrase list will drift), and in the long tail of weird customer reports you can't reproduce because the model gave that one user a different answer.
Spending a couple of hours on the audit and a week on the highest-leverage fixes pays for itself within the first month at any non-trivial scale.
---
## A condensed checklist you can hand to your team
If you skim this whole post and want the elevator version, here it is in one place. None of these items take more than a couple of hours to act on, and most are immediate.
Start by writing down every AI call your app makes today. Set `max_tokens` on every call. Set `temperature` on every call. Pick the cheapest model that gets the job done. Cap your inputs at the source. Pull shared content into single sources of truth. Use tool-use schemas instead of prose JSON descriptions. Cache the system prompts that are big enough and called often enough to matter. Make sure empty template sections are actually dropped, not rendered as `(none)`. Split multi-concern prompts when one piece is forcing you to over-pay on model cost. Make your validators diff rather than regenerate. Add contrastive examples to high-stakes prompts. Keep a living audit document in your repo and review it when you add anything new.
Each item on its own is small. Done together, they routinely cut LLM costs in half and improve quality at the same time. The main reason most teams don't do them is that nobody assigned the audit, not that the work is hard.
Pick one to do this week. Start with the audit document. The rest will follow from it.
---
## End of the series
That's the end of the **Building With LLMs** series. If you want to revisit any of it:
- [Post 1: Should You Build a SaaS in 2026?](blog-series-1-should-you-build.md) — the strategic question, plus eight habits for using AI tools well
- [Post 2: Buy vs. Build After AI](blog-series-2-buy-vs-build.md) — what counts as commodity now, what still defends a SaaS
- [Post 3: Tokens and Temperature, in Plain English](blog-series-3-tokens-and-temperature.md) — the two dials you need to understand
- [Post 4: This post] — the twelve-step playbook
The series exists because the audit it's based on was useful enough to me that I wanted other teams to have the same reference. Take what you need, leave what you don't.
← Previous: [Tokens and Temperature, in Plain English](blog-series-3-tokens-and-temperature.md) · [Series index](blog-series-index.md)
May 2, 2026twelve
Building With LLMs — A 4-Post Series · Part 4
Twelve Steps to a Cheaper, Better LLM App
Twelve Steps to a Cheaper, Better LLM App — explore insights on twelve, steps and more.
twelvestepscheaperbetterllmapp
Turn your brand into content like this
Narratr reads your website and generates SEO-optimised blog posts that sound like you.
Try Narratr free →