Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Every module up to this point used the same streaming events without ever reading two fields that were there the entire time. message_start and message_delta carry token usage on every single stream — input tokens, output tokens, stop reason. The only reason this went unnoticed is that earlier modules only cared about text_delta. The cost and latency data was never missing. It just wasn't being read.

In a demo, that's invisible. One call costs fractions of a cent and nobody's watching the clock. In production serving real traffic, ignoring this data is the difference between a system with predictable unit economics and one that quietly bleeds money on patterns nobody's measuring.

Where the Data Actually Comes From

// message_start — input tokens
{ type: "message_start", message: { usage: { input_tokens: 342 } } }

// message_delta — output tokens + stop reason
{ type: "message_delta", delta: { stop_reason: "end_turn" }, usage: { output_tokens: 156 } }

Token counts combined with model pricing give cost. Timestamp deltas between stream start and stream close give latency. That's the entire data source — no separate API call, no additional request, just reading fields that were already streaming past.

The Wrapper Pattern

The design goal was tracking usage without touching any calling code in the earlier modules. The solution is a generator that wraps the raw stream, intercepts the fields it needs, and yields every chunk through unchanged:

async function* wrappedStream() {
  for await (const chunk of rawStream) {
    if (chunk.type === "message_start") {
      model = chunk.message.model;
      inputTokens = chunk.message.usage.input_tokens;
    }
    if (chunk.type === "message_delta") {
      outputTokens = chunk.usage.output_tokens;
      stopReason = chunk.delta.stop_reason ?? "";
    }
    yield chunk; // transparent — caller sees the exact same stream
  }
  // stream done — compute and persist
}

The consumer of wrappedStream() has no idea tracking is happening. It receives the identical sequence of chunks it would have gotten from the raw Anthropic stream. This pattern generalizes well beyond usage tracking — any time you want to observe a stream without changing what downstream code does with it, wrap and re-yield is the shape to reach for.

Cost Calculation — and the Asymmetry That Changes How You Think About It

export function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
  const pricing = MODEL_PRICING[model];
  const inputCost = (inputTokens / 1_000_000) * pricing.input;
  const outputCost = (outputTokens / 1_000_000) * pricing.output;
  return inputCost + outputCost;
}

The detail that matters most here: input and output tokens are not priced the same. On Sonnet, output tokens cost roughly 5x more than input tokens ($15 vs $3 per million). This inverts an intuition that's easy to carry over from naive token counting — a long, detailed prompt is cheap. A long, verbose response is expensive. Optimizing prompt brevity to save cost is usually optimizing the wrong side of the ledger.

Three Calls, Three Different Lessons

Running three representative query types and logging the results surfaced patterns that weren't visible from token counts alone:

Label                        In    Out    Total   Cost       Latency   Stop
usage-demo/structured        722   110    832     $0.003816  1993ms    tool_use
usage-demo/long                40  1024   1064    $0.0155    22408ms   max_tokens
usage-demo/short                16    64     80    $0.001008 2079ms    end_turn

The Schema Tax

722 input tokens for a prompt that was a single sentence. That gap is almost entirely the tool schema — property names, descriptions, enums, the required array. Every field definition in a tool's input_schema gets tokenized and sent on every single call, not just the first one. For any system making heavy use of forced tool use (the structured output pattern from 04-structured-output), the schema itself is a meaningful and recurring cost line item that's invisible if you're only watching output tokens.

The Most Expensive Call Didn't Even Finish

stop_reason: max_tokens on the long query — this is the truncation failure mode from the error handling module showing up in cost data. The response was cut off mid-generation. $0.0155 and 22 seconds spent on an answer the user never actually got to read in full.

This is the sharpest lesson in the whole dataset: an incomplete response can be the most expensive call in the system. It's not just a UX failure — it's wasted spend with literally nothing delivered. Any production cost monitoring needs to cross-reference cost against stop_reason, not just against token count, or this failure mode is invisible in the aggregate numbers.

The Efficient Baseline Isn't as Cheap as It Looks

16 input tokens, 64 output tokens, $0.001, 2 seconds. In raw token terms this used 10x fewer tokens than the structured call. In cost terms it was less than 4x cheaper ($0.001 vs $0.0038) — because the output tokens in the short response carry 5x the cost weight of input tokens. Token count and dollar cost diverge in exactly the direction the input/output pricing asymmetry predicts.

What the Numbers Actually Tell You

Token count alone doesn't tell you cost. You have to look at the input/output split. structured moved 10x more total tokens than short but cost less than 4x more, because the bulk of its tokens were cheap input tokens (the schema), not expensive output tokens.
An incomplete response can be the most expensive call you make. max_tokens truncation is a cost problem, not just a correctness problem.
The structured output tax is the schema, not the generation. If the goal is reducing cost on tool-heavy calls, the lever is shortening tool descriptions and property definitions — not reducing how much the model generates.
Always check stop_reason alongside cost. A cheap call that didn't finish is strictly worse than a slightly more expensive call that did — and that comparison is invisible if you're only tracking dollars.

Why This Belongs in the Same System as Evals

This module pairs naturally with 07-evals. Evals answer "is the agent's output good enough?" Cost and latency tracking answers "is the agent's output good enough to justify what it costs?" The earlier evals work already surfaced that the agent runs ~3.7 seconds slower than basic RAG per query (12.3s vs 8.6s) because of the extra tool-execution round trip. Layering cost data on top of that completes the picture: is the accuracy gain from live tool data worth the additional dollars and latency, for this specific use case?

That's a question every FDE and applied AI role eventually has to answer for a real customer, and it's a question that's impossible to answer without instrumentation like this in place from the start — not bolted on after a surprise bill.

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Where the Data Actually Comes From

The Wrapper Pattern

Cost Calculation — and the Asymmetry That Changes How You Think About It

Three Calls, Three Different Lessons

The Schema Tax

The Most Expensive Call Didn't Even Finish

The Efficient Baseline Isn't as Cheap as It Looks

What the Numbers Actually Tell You

Why This Belongs in the Same System as Evals

Comments

Anthropic API

Error Handling in LLM Systems — Three Categories, One Decision Tree

More from this blog

Error Handling in LLM Systems — Three Categories, One Decision Tree

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

Streaming Structured Output — Incremental JSON Rendering Without a Parser

Evals — Why a Bad Eval Is Worse Than No Eval

Command Palette

Where the Data Actually Comes From

The Wrapper Pattern

Cost Calculation — and the Asymmetry That Changes How You Think About It

Three Calls, Three Different Lessons

The Schema Tax

The Most Expensive Call Didn't Even Finish

The Efficient Baseline Isn't as Cheap as It Looks

What the Numbers Actually Tell You

Why This Belongs in the Same System as Evals

Comments

Anthropic API

Error Handling in LLM Systems — Three Categories, One Decision Tree

More from this blog