Skip to main content

Command Palette

Search for a command to run...

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Updated
6 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

Every module up to this point used the same streaming events without ever reading two fields that were there the entire time. message_start and message_delta carry token usage on every single stream — input tokens, output tokens, stop reason. The only reason this went unnoticed is that earlier modules only cared about text_delta. The cost and latency data was never missing. It just wasn't being read.

In a demo, that's invisible. One call costs fractions of a cent and nobody's watching the clock. In production serving real traffic, ignoring this data is the difference between a system with predictable unit economics and one that quietly bleeds money on patterns nobody's measuring.

Where the Data Actually Comes From

// message_start — input tokens
{ type: "message_start", message: { usage: { input_tokens: 342 } } }

// message_delta — output tokens + stop reason
{ type: "message_delta", delta: { stop_reason: "end_turn" }, usage: { output_tokens: 156 } }

Token counts combined with model pricing give cost. Timestamp deltas between stream start and stream close give latency. That's the entire data source — no separate API call, no additional request, just reading fields that were already streaming past.

The Wrapper Pattern

The design goal was tracking usage without touching any calling code in the earlier modules. The solution is a generator that wraps the raw stream, intercepts the fields it needs, and yields every chunk through unchanged:

async function* wrappedStream() {
  for await (const chunk of rawStream) {
    if (chunk.type === "message_start") {
      model = chunk.message.model;
      inputTokens = chunk.message.usage.input_tokens;
    }
    if (chunk.type === "message_delta") {
      outputTokens = chunk.usage.output_tokens;
      stopReason = chunk.delta.stop_reason ?? "";
    }
    yield chunk; // transparent — caller sees the exact same stream
  }
  // stream done — compute and persist
}

The consumer of wrappedStream() has no idea tracking is happening. It receives the identical sequence of chunks it would have gotten from the raw Anthropic stream. This pattern generalizes well beyond usage tracking — any time you want to observe a stream without changing what downstream code does with it, wrap and re-yield is the shape to reach for.

Cost Calculation — and the Asymmetry That Changes How You Think About It

export function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
  const pricing = MODEL_PRICING[model];
  const inputCost = (inputTokens / 1_000_000) * pricing.input;
  const outputCost = (outputTokens / 1_000_000) * pricing.output;
  return inputCost + outputCost;
}

The detail that matters most here: input and output tokens are not priced the same. On Sonnet, output tokens cost roughly 5x more than input tokens ($15 vs $3 per million). This inverts an intuition that's easy to carry over from naive token counting — a long, detailed prompt is cheap. A long, verbose response is expensive. Optimizing prompt brevity to save cost is usually optimizing the wrong side of the ledger.

Three Calls, Three Different Lessons

Running three representative query types and logging the results surfaced patterns that weren't visible from token counts alone:

Label                        In    Out    Total   Cost       Latency   Stop
usage-demo/structured        722   110    832     $0.003816  1993ms    tool_use
usage-demo/long                40  1024   1064    $0.0155    22408ms   max_tokens
usage-demo/short                16    64     80    $0.001008 2079ms    end_turn

The Schema Tax

722 input tokens for a prompt that was a single sentence. That gap is almost entirely the tool schema — property names, descriptions, enums, the required array. Every field definition in a tool's input_schema gets tokenized and sent on every single call, not just the first one. For any system making heavy use of forced tool use (the structured output pattern from 04-structured-output), the schema itself is a meaningful and recurring cost line item that's invisible if you're only watching output tokens.

The Most Expensive Call Didn't Even Finish

stop_reason: max_tokens on the long query — this is the truncation failure mode from the error handling module showing up in cost data. The response was cut off mid-generation. $0.0155 and 22 seconds spent on an answer the user never actually got to read in full.

This is the sharpest lesson in the whole dataset: an incomplete response can be the most expensive call in the system. It's not just a UX failure — it's wasted spend with literally nothing delivered. Any production cost monitoring needs to cross-reference cost against stop_reason, not just against token count, or this failure mode is invisible in the aggregate numbers.

The Efficient Baseline Isn't as Cheap as It Looks

16 input tokens, 64 output tokens, $0.001, 2 seconds. In raw token terms this used 10x fewer tokens than the structured call. In cost terms it was less than 4x cheaper ($0.001 vs $0.0038) — because the output tokens in the short response carry 5x the cost weight of input tokens. Token count and dollar cost diverge in exactly the direction the input/output pricing asymmetry predicts.

What the Numbers Actually Tell You

  • Token count alone doesn't tell you cost. You have to look at the input/output split. structured moved 10x more total tokens than short but cost less than 4x more, because the bulk of its tokens were cheap input tokens (the schema), not expensive output tokens.

  • An incomplete response can be the most expensive call you make. max_tokens truncation is a cost problem, not just a correctness problem.

  • The structured output tax is the schema, not the generation. If the goal is reducing cost on tool-heavy calls, the lever is shortening tool descriptions and property definitions — not reducing how much the model generates.

  • Always check stop_reason alongside cost. A cheap call that didn't finish is strictly worse than a slightly more expensive call that did — and that comparison is invisible if you're only tracking dollars.

Why This Belongs in the Same System as Evals

This module pairs naturally with 07-evals. Evals answer "is the agent's output good enough?" Cost and latency tracking answers "is the agent's output good enough to justify what it costs?" The earlier evals work already surfaced that the agent runs ~3.7 seconds slower than basic RAG per query (12.3s vs 8.6s) because of the extra tool-execution round trip. Layering cost data on top of that completes the picture: is the accuracy gain from live tool data worth the additional dollars and latency, for this specific use case?

That's a question every FDE and applied AI role eventually has to answer for a real customer, and it's a question that's impossible to answer without instrumentation like this in place from the start — not bolted on after a surprise bill.

3 views

More from this blog

D

David Hahn | Applied AI Engineering

15 posts

Real implementations, real bugs, and the mental models behind building on LLMs. I'm a fullstack engineer with 10+ years of experience, including 4 years at Apple, now going deep on applied AI: streaming, RAG, tool use, agents, and the full stack of tooling modern AI products are built on.