May 17, 2026real

From $2,400 to $680: Real Patterns for Claude Cost Control

From $2,400 to $680: Real Patterns for Claude Cost Control — explore insights on real, patterns and more.

realpatternsclaudecostcontrol

If you've spent any time with Claude Code or the Anthropic API in 2026, you've probably had this experience: you start what feels like a simple task, the agent starts "exploring," twenty minutes later it's compacting context for the third time, and your monthly subscription is gone before you've shipped a single feature. Or, on the API side, you look at the invoice at end of month and the number is two or three times what you modeled.

Token waste in agentic Claude usage isn't a mystery. It comes from two distinct problems, and they have different fixes.

Structural waste is what every call costs in steady state: large system prompts re-sent on every request, oversized tool outputs round-tripped through the model, the most expensive model running tasks the cheapest could handle.

Behavioral waste is what happens when the agent loses the plot: loops, re-exploration, vague autonomy, lossy context compaction, twenty minutes of unbounded thinking before someone hits Escape.

Most teams overspend on both at once. The good news is that the fixes are well-understood at this point — production teams routinely report 60–80% reductions in spend with no quality loss when they take this seriously. One six-person dev team went from $2,400/month to $680/month — a 72% drop through caching, budgets, and model switching.

Here's what actually moves the needle, in rough order of impact.


1. Prompt caching — the single biggest lever

This is the optimization that pays for itself on day one. You mark the stable prefix of your prompt with cache_control, and subsequent calls hitting the same prefix pay roughly 10% of the normal input cost for cache reads. Production workloads regularly see 60–90% reductions in input cost when caching is implemented well — or nothing at all when it's implemented poorly.

What to cache: your system prompt, tool definitions, large reference documents, few-shot examples — anything that doesn't change between calls. Put variable content (the user's actual question, dynamic context) after the cache breakpoint.

The killer gotcha: cache hits require the prefix to be byte-identical. A timestamp at the top of your system prompt — even just the current date — will silently kill your hit rate. Same with any kind of session ID, user ID, or "personalization token" injected at the top of the prompt.

Measure it. Every API response carries cache_read_input_tokens and cache_creation_input_tokens. Your cache hit rate is:

cache_read_input_tokens / (cache_read_input_tokens + input_tokens)

Below 60% on a production workload means there's headroom. Below 30% means you probably have a stability problem in your prefix and should audit what's going in.

python# Bare-bones example with the Anthropic SDK
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_STABLE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)

usage = response.usage
hit_rate = usage.cache_read_input_tokens / (
    usage.cache_read_input_tokens + usage.input_tokens
)

Caches expire silently after the TTL window (5 min default, 1 hour available on a premium tier). For bursty traffic, monitor cache_creation_input_tokens — if it's spiking, your bursts are landing outside the window and you're paying the write premium repeatedly.


2. Model routing — stop running everything on Opus

The default of "Opus for everything" is the most expensive setup possible and rarely justified. A practical split:

  • Haiku for code search, summarization, classification, terminal output curation, formatting, simple extraction, and "is this thing X or Y" classification

  • Sonnet for the bulk of real work — it delivers most of Opus's quality at a fraction of the cost

  • Opus for architecture decisions, hard debugging, and orchestration of subagents

In Claude Code, you can specify the model per subagent, so the main agent on Opus can delegate implementation work to Sonnet and grep-style searches to Haiku. The pattern that consistently wins:

Opus (orchestrator)
  ├── Sonnet (implementation)
  ├── Sonnet (test writing)
  └── Haiku (code search, file inspection, lint fixes)

If you're running pure Opus for everything in Claude Code, you're probably overspending by 3–5x for marginal quality gains on the easy tasks.


3. The agent loop problem

This is the one most people feel viscerally. You ask the agent to do a thing. It "researches." It tries. It fails. It compacts the conversation. It re-reads the same files. It tries again. Twenty minutes and 70,000 tokens later, nothing has happened and you're hitting Escape.

A widely-reported failure mode involves unbounded thinking loops after compaction: 21 minutes, 72,900 tokens, zero output, no tool calls. The user has to manually interrupt.

The root cause is that the tool is stateless and eager to please — when those two traits collide with lossy compression of accumulated context, you get the loop.

Concrete things that prevent it:

Session hygiene

Treat sessions as disposable — one bug or feature per session. The 30-minute rule: if a session has run longer than 30 minutes, you're carrying garbage context from failed attempts and error logs. /clear or restart. The useful heuristic: if you'd open a new document for this work, you should /clear in Claude Code.

Plan-then-execute, not "go solve it"

Vague autonomy is where loops happen. Front-load the spec, then have the agent write a plan to a file, then execute task-by-task.

Phase 1:  Create a plan in .ai/plan.md and task files in .ai/tasks/.
          Do not implement anything yet. Stop after the plan.

Phase 2:  Implement tasks from .ai/tasks/ sequentially, one at a time.
          Run smoke tests after each. Stop if a test fails twice.

This converts "long reasoning loop with vague goal" into "series of short focused operations with explicit checkpoints." The token savings are dramatic — usually 3–5x on multi-file features.

Tight subagent briefs

❌ "Explore the codebase and find anything auth-related."

✅ "Read only src/api/auth/ and summarize the auth endpoints."

The first burns tokens on open-ended exploration. The second is bounded, cheap, and returns something useful. No-op subagent spawns — agents that run, find nothing, and report back — still cost you the spawn-up overhead and the wasted exploration. Vague scope means expensive exploration, every time.

Constraints over processes

Don't tell the agent how to think. Don't prescribe ReAct, chain-of-thought, or "use this framework." Instead, tell it what not to do:

  • "Verify the file exists before editing"

  • "Don't guess imports — read them from the file"

  • "If a test fails twice with the same error, stop and report"

  • "Don't refactor unrelated code while you're there"

Constraints compose. Processes fight the model's training.

CLAUDE.md and .claudeignore

These two files alone fix a huge fraction of behavioral waste. CLAUDE.md loads once per session and gives the agent the project conventions it would otherwise have to be told (and re-told) over and over. .claudeignore stops the agent from reading bloat — node_modules/, build artifacts, lock files, generated code — that has been shown to burn 55,000 tokens before you've typed a word.

Full working examples of both are in the bundle at the end of this post.


4. Tool output is silently expensive

This one's underrated. Every tool result goes back into the model's context — and you pay for it. One developer watched Claude Code feed 108,894 bytes of seq 1 20000 back into its own context window — 20,000 integers, no errors, no signal, just counting. The system still had to tokenize all of it, send it back to the model, and bill for it.

The absurd version makes the headline. The realistic version is what quietly drains your budget.

A real one, from our own logs

Building the SEO audit feature for Narratr, we needed the agent to analyze meta titles, descriptions, and image alt text across a merchant's product catalog. The first implementation looked reasonable on paper: query the Shopify GraphQL API for products, score each one against the rubric, return a report.

The query was the problem. The agent — left to its own devices — wrote a "let's see what's there" query that pulled every field on the product type. Full HTML body content, every variant with its pricing and inventory, every image with metadata, all the metafields, the complete SEO object, publication status across channels. On the test merchant with ~200 products, one fetch was about 80,000 tokens of JSON.

Then the cycle started:

StepWhat happenedTokens1Initial fetch — all fields, 200 products~80K2TypeScript error on the scorer → re-fetch "to verify the response shape"~80K3Pagination cursor handled wrong → fetch the next page~60K4Scoring rubric needed reworking → re-test against fresh data~80K5Image altText wasn't matching the expected structure → fetch again to inspect~60K6One more round after a Prisma schema tweak~80K

About 18 minutes and ~440,000 tokens later, we had a working scorer. Roughly 80% of those tokens were the same product data, in slightly different shapes, that the agent had already seen two or three times.

The fix was a four-line change to the GraphQL query:

graphql{
  products(first: 50) {
    edges {
      node {
        id handle title
        seo { title description }
        featuredImage { altText }
      }
    }
  }
}

Per-product payload dropped from ~2KB to ~200 bytes. 10x reduction, immediately. And — this is the part that mattered more than the raw math — the agent stopped re-fetching, because the smaller payload sat comfortably in context and there was no longer any reason to "re-verify the shape." The loop ended because the loop's fuel ran out.

The principle

Logs, test runs, ps listings, build spam, progress bars, repeated warnings, full API responses — a depressing amount of tool output is decorative confetti that costs premium model rates to process. When you build a tool or function the agent calls, default to minimal projection. Return what's actually needed for the immediate task. Give the agent the option to ask for more detail on specific items, but never return the full firehose by default.

For us, the meta-lesson was that the tool definition itself is a prompt-engineering surface. We now treat every internal tool wrapper the same way we treat the system prompt: every field returned is a deliberate choice, and the default is to return less.

Patterns that help:

  • Truncate or summarize verbose output before returning. A test runner that returns "47 passed, 0 failed, 28s" is 10 tokens. The same runner returning the full output is 2,000+.

  • Return IDs and metadata, let the model ask for details. If your tool fetches a list of 100 products, return IDs and titles, not full product objects with descriptions, reviews, and image URLs.

  • The curator pattern. Use a cheaper model (Haiku) as a pre-processor for tool output before it reaches the main model. Especially valuable for log analysis, build output, and anything else where the signal is buried in noise.

If you're building an MCP server or wrapping an API for Claude to use, think hard about what your tools return. The default of "echo the entire API response back" is almost always wrong.


5. Habits worth building

Smaller things that add up:

  • Run /statusline in Claude Code to keep cost and context usage visible at all times. Without it, you have no feedback loop and no pain signal until the monthly bill arrives.

  • Set max_tokens deliberately. Defaulting to 4096 when you need a yes/no answer is pure waste. There's a max_tokens: 0 mode that lets you warm a cache without paying for output at all.

  • Use stop sequences. Prevents the model from rambling past the useful answer.

  • Log usage on every response in production. Build a per-feature cost dashboard. You cannot optimize what you do not measure.

  • Use the Message Batches API for non-real-time work. 50% off for anything that doesn't need a synchronous response — overnight content generation, bulk analysis, batch SEO audits.

  • Disable plugins you don't use. Five idle plugins can burn 55,000 tokens before your first message.


A working starter bundle

Here's a complete set of context-engineering files that put the above into practice. Drop them into the root of any project and adapt.

.
├── CLAUDE.md                          # Project context loaded into every session
├── .claudeignore                      # Paths the agent never reads
├── .claude/
│   └── skills/
│       ├── test-runner/SKILL.md       # Smoke-test discipline
│       └── git-commit/SKILL.md        # Commit conventions
└── .ai/
    └── prompts/
        └── plan-then-execute.md       # The two-phase workflow

CLAUDE.md

The single most valuable file in your repo from a token-economics standpoint. It loads into every session as a stable prefix → maximum cache value. Keep it under ~150 lines. Anything longer is sucking budget that should be going to your actual task.

markdown# Project: <your-project>

This file loads into every Claude Code session. Keep it tight — every line
here is paid for once per session, but it saves you from re-explaining the
same things turn after turn.

## Stack

- Next.js 14 (App Router, server components by default)
- TypeScript with `strict: true`
- Tailwind CSS — **no separate CSS files**, no CSS-in-JS
- Prisma + PostgreSQL
- pnpm (not npm, not yarn)

## Conventions

- Server components by default; add `"use client"` only when state, effects,
  or browser APIs are actually needed
- Imports ordered: external → internal aliases (`@/...`) → relative
- No `any`. Use `unknown` and narrow with type guards
- Functions over classes; pure functions where possible

## Token-saving rules (read these first)

- **Do not explore the codebase.** If you need a file, ask which one, or use
  the path I gave you. No `find`, no recursive `grep`, no `ls` of `src/`.
- **Do not read** `package.json`, lock files, `node_modules/`, `.next/`, or
  anything in `.claudeignore` unless explicitly asked.
- **Do not add dependencies** without asking. Our deps are curated.
- **Verify before acting.** Check that a file exists before editing it.
  Check that an import resolves before adding it.
- **Two-strike rule.** If a test or build fails twice with the same error,
  STOP and report. Do not keep trying variations.
- **No speculative refactoring.** Change only what the task requires.

## Workflow

For anything touching more than one file, use plan-then-execute:

1. Read `CLAUDE.md` and any skills referenced by the task
2. Write a plan to `.ai/plan.md` (goal, affected files, task list, risks)
3. Create task files in `.ai/tasks/NN-task-name.md`
4. STOP. Wait for my approval before implementing.
5. Implement tasks sequentially. One task = one focused change + tests.
6. Move completed task files to `.ai/tasks/done/`.

## Testing

- Run `pnpm test:smoke` after every code change (~30s). Non-negotiable.
- Run `pnpm typecheck` if you touched types or interfaces.
- Run `pnpm test` (full suite) before suggesting a commit.
- If smoke fails, fix the cause. Do not skip or `.skip` tests.

## Commits

- Conventional commits (`feat:`, `fix:`, `refactor:` etc.)
- Imperative mood, ≤ 60 char subject
- **No AI signatures, no co-author lines, no emoji**

## Things I never want

- Code comments that restate what the code does
- `console.log` left in committed code
- Defensive checks for cases that cannot happen
- `// TODO` without a ticket reference
- Auto-formatting of files unrelated to the current task

.claudeignore

Be aggressive. The agent can always ask for something you've excluded; it can't unsee something you let it read.

gitignore# === Dependencies ===
node_modules/
.pnpm-store/
vendor/

# === Build artifacts ===
.next/
.turbo/
dist/
build/
out/
.cache/
coverage/

# === Generated code ===
prisma/migrations/
*.generated.ts
src/lib/types/graphql.ts

# === Lock files ===
package-lock.json
yarn.lock
pnpm-lock.yaml

# === Large data ===
*.csv
*.sql
*.sqlite
seed-data/
**/*.snap

# === Logs and env ===
*.log
.env
.env.*
!.env.example

# === Media ===
*.png
*.jpg
*.mp4
*.pdf

# === Archived docs ===
docs/archive/
CHANGELOG.md

.claude/skills/test-runner/SKILL.md

Skills are focused, on-demand capability modules. Each one has a strong description field — that description is what triggers loading, so write it like an instruction.

---
name: test-runner
description: Use this skill before declaring any code change complete. Runs smoke tests, interprets failures, and surfaces regressions. Trigger after ANY code modification — new feature, bug fix, refactor, or "trivial" rename. Never skip on the assumption that a change is too small to break things.
---

# Test Runner

The cheapest token-save

Turn your brand into content like this

Narratr reads your website and generates SEO-optimised blog posts that sound like you.

Try Narratr free →