David Hahn | Applied AI Engineering

Prompt Engineering Lessons from Building a Problem Generator

David Hahn — Sat, 06 Jun 2026 01:38:58 GMT

Prompt engineering only becomes interesting when you're prompting for structured, specific output that has to be reliable enough to feed downstream systems. A chatbot that gives a slightly different answer each time is fine. A problem generator that occasionally returns vague prompts, missing constraints, or malformed JSON breaks the entire study session.

This post documents what I learned building the problem generation component of my study system, where a prompt had to produce interview-caliber problems with consistent structure on every run.

What the Component Needs to Produce

The output isn't a free form answer: it's a structured problem that can be immediately worked on:

{
  "topic": "React state management",
  "difficulty": "medium",
  "prompt": "Build a shopping cart component that...",
  "constraints": ["Must handle concurrent updates", "No external state library"],
  "examples": [
    { "input": "addItem({ id: 1, name: 'Widget', price: 9.99 })", "output": "Cart: 1 item, $9.99 total" }
  ],
  "setup_code": "const initialCart = { items: [], total: 0 };"
}

If any field is missing or vague, the downstream session breaks. That constraint forced me to treat output reliability as a first-class requirement from the start.

Iteration 1: The Prompt Was Too Rigid

The first version of the generation prompt was fully hardcoded: topic, difficulty, and output structure baked in as static text. It worked but couldn't adapt. As the system matured to accept history, difficulty calibration, and topic filtering, a rigid prompt became a bottleneck.

The lesson: prompts for production components should be built as functions, not strings. Dynamic inputs (topic, difficulty, past problems, constraint filters) compose into the prompt at call time. The static portions are the instructions and schema. The dynamic portions are the context.

The Model Sometimes Ignores Formatting Instructions

I explicitly told the model not to wrap the response in markdown code blocks. It ignored this instruction often enough to be a problem — not on every call, but on enough that I couldn't rely on clean JSON coming back.

The fix is defensive stripping in addition to the instruction, not instead of it:

raw = response.content[0].text
cleaned = raw.strip()
if cleaned.startswith("```"):
    cleaned = "\n".join(cleaned.split("\n")[1:])
if cleaned.endswith("```"):
    cleaned = "\n".join(cleaned.split("\n")[:-1])
result = json.loads(cleaned)

The instruction reduces frequency; the stripping handles the remainder. Both are necessary. This is a general pattern: for any structured output that feeds application code, you need both a clear prompt instruction and a parsing layer that handles model non-compliance gracefully.

Scope Decisions That Saved Time

Two scope decisions I explicitly deferred kept Phase 1 on track:

setup_code as a string, not an array. The ideal design for multi-file problems (HTML + CSS, for example) would be an array of file objects. But for a CLI-based tool in the current scope, a single string is sufficient and eliminates the complexity of file-aware rendering. I documented this as a known limitation to revisit.

Problem formatting as plain text, not HTML. The model naturally wanted to format problems with rich structure. Useful eventually, but not useful when the output surface is a terminal. Deferring this prevented UI work from blocking the core prompt engineering work.

Both decisions share a pattern: identify the simplification that keeps the current phase moving without closing doors, and write down what you're deferring and why.

When Past Problems Scope Outgrew the Component

One feature I initially planned to include in the generator (passing in past problems to avoid repetition) revealed a scope issue mid-build. Avoiding repetition is a scheduling concern, not a generation concern. The generator should generate; a higher-level orchestrator should decide whether to generate a new problem or surface a review problem from history.

Keeping the generator's interface narrow (topic + difficulty → structured problem) made it more composable and easier to test in isolation. The past problem logic belongs in the component that calls the generator, not inside it.

This is a prompt engineering lesson as much as a software design one: if your prompt is growing to handle multiple concerns, it's often a sign the abstraction is wrong, not that the prompt needs more instructions.

What This Looks Like in Practice

After several iterations, the generator produces problems specific enough to start immediately and structured enough to parse reliably. More importantly, the prompt is a function — inputs compose cleanly, the output schema is stable, and failures are handled defensively at the parsing layer.

The study system is built on top of this component's reliability. If the generator is flaky, every downstream session is degraded. Treating prompt reliability as a first-class engineering concern — not just "get the words right" but "design a prompt that fails gracefully and consistently" — is the lesson that transferred most directly to how I think about building on LLMs in general.

Structured Output — When to Use Prompting vs. Forced Tool Use

David Hahn — Sat, 06 Jun 2026 01:34:17 GMT

Most LLM features eventually need the model to return structured data like a JSON object your application can parse and act on. Connecting LLM output to a database, a UI component, or a downstream API requires structure you can depend on.

There are two approaches. Choosing the right one changes how reliable your system is.

Approach 1: Prompt-Based

Tell the model in the system prompt to return JSON matching a specific shape.

const systemPrompt = `
You are a grading assistant. Return your evaluation as a JSON object with this structure:
{
  "score": number (0-10),
  "passed": boolean,
  "feedback": string
}
Return only the JSON object. No markdown, no explanation, no code blocks.
`;

This works most of the time. The model is good at following formatting instructions for common shapes. The problem is "most of the time". Occasionally, it wraps the response in markdown code blocks, adds a preamble sentence, or returns a subtly malformed object. You end up parsing defensively:

const text = response.content[0].text;
const cleaned = text.replace(/```json|```/g, '').trim();
const parsed = JSON.parse(cleaned);

For low-stakes or high-volume use cases where occasional failures are acceptable, prompt-based is fine. It's also simpler to iterate on — just edit the prompt.

Approach 2: Forced Tool Use

Define a tool whose input schema is exactly the shape you want, then force the model to call it:

tools: [{
  name: "submit_evaluation",
  description: "Submit the structured evaluation result",
  input_schema: {
    type: "object",
    properties: {
      score: {
        type: "number",
        description: "Score from 0-10 where 0 is completely wrong and 10 is perfectly accurate"
      },
      passed: { type: "boolean" },
      feedback: { type: "string", description: "One to two sentences of specific, actionable feedback for the learner" }
    },
    required: ["score", "passed", "feedback"]
  }
}],
tool_choice: { type: "tool", name: "submit_evaluation" }

With tool_choice forcing the model to call this tool, the response is always valid JSON conforming to the schema. The model's tool-calling pathway is specifically trained for schema compliance. You extract the input directly:

const toolUse = response.content.find(b => b.type === "tool_use");
return toolUse.input as EvaluationResult;

The tool never "runs" anything. Its only purpose is to give the model a schema to conform to.

When Property Descriptions Carry All the Weight

With tool_choice forcing the call, the top-level tool description matters less — the model has no choice but to use it. What matters are the property descriptions, because those guide what value the model generates for each field.

// Vague — model guesses what's expected
score: { type: "number" }

// Specific — model knows exactly what scale to use and what the extremes mean
score: {
  type: "number",
  description: "Score from 0-10 where 0 is completely wrong and 10 is perfectly accurate. Use the full range — a partial correct answer should score 4-6, not cluster near the top."
}

This is a direct parallel to how description quality drives tool selection reliability in multi-tool setups. Whether the model is choosing which tool to call or filling in a field value, the description is the only lever you have to influence the output toward what you actually want.

How to Debug When the Model Picks Wrong

When you're not using tool_choice and the model selects the wrong tool or generates an unexpected value, there's no stack trace. The decision is opaque. The places to look:

Tool and property descriptions: is there enough differentiation between tools? Are the property descriptions specific enough about valid values?
System prompt: can you add explicit ordering instructions ("always call X before Y")?
tool_choice override: if a specific call must happen, force it rather than nudging

You're always nudging when you're not using tool_choice. Evals matter here because you can't inspect the decision — you can only measure outcomes across many runs and catch regressions.

The Decision Framework

Situation	Approach
Simple, well-defined schema, failures tolerable	Prompt-based
Complex schema, downstream systems depend on it	Forced tool use
You need to guarantee a specific call happens	Forced tool use with `tool_choice`
Multiple structured output types in one call	Define multiple tools, let model choose
High iteration speed matters most	Prompt-based (faster to edit)

The tradeoff is always reliability vs. flexibility. Forced tool use is more reliable but locks you into a schema. Prompt-based is easier to iterate but requires defensive parsing. In production systems where structured output feeds other components, the reliability is usually worth it.

The LLM-as-Judge Problem — Making Automated Evaluation Reliable

David Hahn — Thu, 04 Jun 2026 21:51:25 GMT

Automated evaluation using an LLM sounds like an elegant solution until you understand its failure modes. The model playing the role of a teacher grading work has four well-documented ways to get it wrong. And production systems that ignore them produce grades that are either systematically lenient or systematically inconsistent.

This post documents what I learned building the grading component of my personal study system, where I had to solve this problem for real.

The Four Failure Modes

Position bias. If you ask a judge to compare two answers (A vs. B), it tends to favor whichever comes first. Swap the order and you can get the opposite verdict. For pairwise evaluation, always test with swapped order and check consistency.

Verbosity bias. Longer, more confident-sounding answers score higher even when they're less accurate. The judge rewards the appearance of thoroughness. This is particularly dangerous in code review — a long but wrong implementation can outscore a short but correct one.

Self-preference bias. A model tends to rate outputs from itself (or similar models) more favorably than outputs from other models. If the same model generates the problem and grades the solution, this bias is active.

Sycophancy. If the prompt gives any hint of what answer you want, the model leans that way. "Confirm this is correct" versus "evaluate this objectively" produces meaningfully different results even when grading identical content.

The Design Decisions I Made

Rubric Decomposition Over Holistic Judgment

Instead of asking "is this a good solution?", every criterion is an atomic yes/no check with a specific description:

{
  "label": "Edge cases handled",
  "points": 2,
  "description": "Handles empty input, null/None, zero, or boundary values without crashing or returning wrong output.",
  "evaluation_type": "cascading",
  "evaluation_dependency": "correct_output"
}

Two things to note here. First, the description gives the model a concrete test to apply, not a judgment call to make. Second, the evaluation_type and evaluation_dependency fields encode a relationship I discovered was being missed: edge case handling is meaningless to evaluate if the primary output is wrong. Adding cascading dependencies prevents the judge from awarding points for edge cases in a solution that doesn't produce correct output for the happy path.

This dependency modeling took several iterations to get right. Initially I omitted it, and the grader was being too lenient on incorrect solutions because it evaluated each criterion independently.

Two-Pass Evaluation With Framing

To combat sycophancy and improve consistency, I run the grading prompt twice with different framings:

Strengths framing: "Focus on what the learner is doing well"
Gaps framing: "Focus on what the learner is missing or could fail on"

The gaps framing is dominant — it's the one that drives the final score. If the difference between the two passes exceeds 5% of the total point value, the result is flagged as uncertain. The threshold is arbitrary but functions as an early signal that the prompt needs tuning, since high discrepancy usually means the criteria descriptions are ambiguous.

The tradeoff I accepted: this approach isn't maximally accurate, but it's calibrated for the right direction. For a personal learning tool, being harder on gaps than strengths is a reasonable bias — I want the system to catch things I missed, not validate things I did right.

Criterion Descriptions Take Most of the Work

The most time-consuming part of building this wasn't the code — it was iterating on the criterion descriptions until the grader produced sensible results. One specific example: the time complexity criterion was initially docking points for an O(2n) solution, claiming a more efficient approach existed, when no such approach was possible. Two fixes resolved this:

Added explicit language to the description: "If a simpler approach has the same Big-O complexity class, prefer that. Only flag if a significantly better class is straightforward to achieve."
Added explicit language about O(n) and O(2n) being the same complexity class.

The underlying issue was the model hallucinating an O(n) approach that didn't exist and penalizing the solution accordingly. Clear criterion descriptions prevent this by constraining the model's judgment to what's actually being measured.

Chain-of-Thought Before the Verdict

The prompt explicitly asks for reasoning before the score:

## Instructions
1. For each rubric criterion, reason through whether the learner's answer satisfies it.
2. Assign points: full points if satisfied, 0 if not.
3. Return your response as a JSON object in exactly this format, with no other text.

Asking the model to reason first measurably reduces snap judgments. The reasoning also surfaces when the model is confused — if the reasoning field contains contradictory logic but is_satisfied is true, that's a signal to revisit the criterion description.

The Broader Point for Production Systems

Every production LLM evaluation system faces this problem in some form. Whether you're evaluating RAG retrieval quality, agent decision quality, or generated content — you need automated scoring you can trust.

The patterns I used here (rubric decomposition, multi-pass with framing, explicit reasoning before verdict, dependency modeling) all generalize. They're not study-buddy-specific. They're the standard toolkit for making LLM-as-judge reliable enough to act on.

Designing an LLM System That Actually Solves a Real Problem

David Hahn — Thu, 04 Jun 2026 20:33:38 GMT

Most LLM project ideas start from the technology. "What can I build with agents?" or "Let me try a RAG pipeline." That approach produces demos that are interesting for a day and abandoned by the weekend.

This series documents a different starting point: I had a real, recurring problem in my daily workflow, and I decided to build an LLM-powered system to solve it. The problem forced me into every major skill category that matters for applied AI engineering: prompt engineering, eval frameworks, agent architecture, structured output, and spaced repetition scheduling. Not as checkboxes, but as genuine design decisions with real tradeoffs.

The Problem

My daily technical practice sessions were almost entirely manual. Each morning I had to: decide what to work on, search through past notes to find problems I'd struggled with, prompt an AI step by step through a structured exercise, grade my own output, and log the session somewhere. This took meaningful time and was inconsistent. On high-friction mornings, the prep overhead itself became an excuse to skip.

The other problem was invisible: I had no spaced repetition. Problems I struggled with didn't resurface at the right time. I'd nail something in a session and not see it again for weeks, or hit the same failure mode repeatedly because I had no system tracking it.

What "actually works" looks like: wake up, run one command, get a study plan and a problem to work on. At the end of the session, submit the solution and get a graded report that tells me what I missed. Have that logged automatically so the system knows what to surface tomorrow.

Why Not Just Use an Existing Tool

LeetCode has no model of how I study. Anki handles repetition but not problem generation or grading. ChatGPT can do pieces of this but has no memory or state across sessions. The combination I needed (personalized problem generation, structured grading against my own rubric, and spaced repetition driven by actual performance data) didn't exist off the shelf. Building it was also the point: every component maps directly to applied AI skills.

The System's Six Jobs

When I broke down what the system actually needs to do, it decomposed into six functional components:

Problem generation: generate a relevant problem based on current skill gaps and history
Grading + feedback: evaluate a submitted solution against both objective and qualitative criteria
Progress tracking: automatically log sessions, scores, and patterns over time
Spaced repetition: resurface problems I struggled with at the right interval
Topic suggestion: recommend what to focus on next based on patterns in what I'm getting wrong
Schedule generation: produce a daily study plan that incorporates all of the above

The build order matters. Problem generation has the fewest unknowns and can be built with current knowledge. Grading is blocked on a design decision about evaluation reliability. Everything downstream depends on progress tracking. I sequenced the build to unblock design decisions as quickly as possible rather than building in order of appearance.

The Hardest Design Problem: Who Grades the Grader?

Before writing a line of code, I identified the blocking design decision: if the same model generates a problem AND grades the solution against it, it's evaluating its own output. Lenient generation leads to easy grades. The system becomes self-congratulatory.

This isn't just a personal project concern. It's one of the central problems in production LLM evaluation systems: how do you prevent model-generated rubrics from being too accommodating of model-generated solutions?

I researched three patterns before making a design decision:

Multi-model evaluation: use a different model to grade than the one that generated the problem. Breaks the self-preference loop.
Rubric decomposition: instead of asking "was this good?", break the evaluation into atomic yes/no checks. Harder to be lenient when each criterion has a specific description.
Adversarial test generation: prompt a model specifically to find edge cases the solution might miss, rather than just verifying the happy path.

The decision I landed on, and why, is the subject of the next post in this series.

What This Looks Like as an Architecture Decision

The FDE and applied AI roles I'm targeting care about this kind of thinking. Not "I used the Anthropic API" but "I identified a reliability problem at the system design level, researched the known solutions, and made an explicit decision with documented tradeoffs." That's the work — not the code.

The component build plan, the sequencing logic, the blocking research question — these are the artifacts of that thinking. The code comes after.

Building RAG from Scratch — Embeddings, pgvector, and a Bug Worth Knowing

David Hahn — Thu, 04 Jun 2026 20:27:52 GMT

RAG sounds complex until you break it into its actual steps:

Query → embed query → search vector store → retrieve top N chunks → prompt + chunks → generate

At its core, it's a retrieval problem with a generation step at the end. The model doesn't have access to your data — it reasons over whatever you include in the prompt. RAG is the mechanism for deciding what to include.

What Embeddings Actually Are

An embedding is a numerical representation of text — a list of floats (a vector) that captures semantic meaning. Text with similar meaning produces vectors that are close together in high-dimensional space.

This is what makes semantic search work. When you embed a query and search for the stored vectors nearest to it, "nearest" means semantically similar — not lexically similar. The query "how do I cancel my subscription" will find documents about "account cancellation" and "ending a membership" even if neither phrase appears in the query.

Traditional keyword search matches words. Embedding-based search matches meaning. That distinction matters a lot when users phrase things differently than your documentation does.

The Stack

For a basic RAG implementation in TypeScript/Node.js:

Anthropic API for the generation step
OpenAI embeddings for creating and querying vectors (Anthropic's SDK doesn't expose an embeddings API, so OpenAI fills that gap)
pgvector on PostgreSQL for the vector store
NDJSON streaming to push results to the client incrementally

The pgvector setup is straightforward. It's a Postgres extension that adds a vector column type and similarity search operators. You store your document chunks with their embeddings, then query for the closest matches at retrieval time.

Balancing the Similarity Threshold

Every RAG implementation needs a similarity threshold — a cutoff below which retrieved chunks are considered too dissimilar to be relevant.

Setting this wrong in either direction causes real problems:

Too high: You filter out chunks that are relevant but not close to an exact phrasing match. The model gets less context than it should and either makes things up or says it doesn't know.

Too low: You retrieve chunks that aren't genuinely relevant, adding noise that degrades the quality of the generated response. And it costs tokens.

There's no universal right answer here. The threshold needs to be tuned against real queries from your specific use case. Start conservative (higher threshold, fewer results) and loosen it as you observe misses.

The pgvector Bug That Trips You Up

Here's the production debugging story that makes this post worth reading.

When I was building the RAG module, I hit a case where the similarity search was returning empty results even for queries that clearly matched stored documents. The data was there. The embeddings were correct. The query looked right.

The culprit: a known pgvector behavior where referencing the same parameterized vector expression more than once in a single query causes it to return nothing.

This query fails silently:

SELECT id, content
FROM documents
WHERE 1 - (embedding <=> \(1::vector) > \)2
ORDER BY embedding <=> $1::vector

The vector $1::vector is referenced twice — once in the WHERE clause and once in the ORDER BY. pgvector evaluates it twice, and the second evaluation returns empty.

The fix is a subquery that evaluates the expression once and references the result:

SELECT id, content, similarity
FROM (
  SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
  FROM documents
) AS ranked
WHERE similarity > $2
ORDER BY similarity DESC

This pattern evaluates the vector expression a single time in the inner query, then filters and sorts against the pre-computed similarity score in the outer query. The results come back correctly.

The practical rule: never reference the same parameterized vector more than once in a single pgvector query.

NDJSON for Streaming Mixed Content

Basic streaming pushes raw text strings to the client. RAG adds a retrieval step before generation — and the client needs to know about both. What got retrieved? When does generation start?

The answer is NDJSON (newline-delimited JSON): each chunk pushed through the stream is a JSON object with a type field:

// Retrieval result
controller.enqueue(encoder.encode(JSON.stringify({ type: "sources", data: retrievedChunks }) + "\n"));

// Generated text
controller.enqueue(encoder.encode(JSON.stringify({ type: "text", delta: chunk.delta.text }) + "\n"));

The client splits incoming data on newlines and parses each line independently. A partial reader.read() result gets buffered until the next \n arrives. This is also why TextEncoder becomes necessary here — ReadableStream expects Uint8Array, and NDJSON requires explicit encoding rather than relying on runtime tolerance for plain strings.

The Broader Pattern

The pgvector bug is a good example of a class of problems that's common in applied AI work: the integration layer between the model and your data infrastructure has its own failure modes that have nothing to do with the model. Debugging them requires treating each layer (the embedding generation, the vector store query, the retrieval pipeline) as independently testable components.

In production RAG systems, most failures happen in retrieval, not generation. The model does a reasonable job if given good context. The hard part is reliably getting it that context.

Tool Use — How the Model Calls Your Code (And What It Never Sees)

David Hahn — Wed, 03 Jun 2026 00:51:10 GMT

One of the most important things to internalize about LLM tool use is what the model actually does (and doesn't do) when it "calls" a function.

Anthropic never executes your code. The model reads the description and schema you provide, decides this is the right tool for the job, generates a valid set of arguments, and stops. You run the function. You send the result back. The model then continues generating based on that result.

Understanding this distinction changes how you think about building reliable tool-based systems.

The Conversation Flow

Tool use turns a single API call into a multi-turn exchange:

You send a message + your tool definitions
The model responds with a tool_use block containing a name and arguments
You run the actual function with those arguments
You send the result back as a tool_result message
The model generates its final response using that result

This is fundamentally different from a normal completion. The model is mid-sentence when it decides to call a tool — it needs you to act on that, then resumes where it left off.

What This Looks Like in Code

In a streaming context, adding tool use changes the stream in two ways. First, instead of only seeing text_delta events, you'll now see input_json_delta events — the model streaming the JSON arguments for a tool call in chunks. Second, content_block_start becomes load-bearing, because it tells you what kind of block is opening:

// content_block_start for a tool call looks like:
{
  type: 'content_block_start',
  content_block: {
    type: 'tool_use',
    id: 'toolu_01A09q90qw90lq917835lq9',
    name: 'get_weather',
    input: {}
  }
}

When you see a tool_use block starting, you capture the name and id, then accumulate the incoming input_json_delta chunks into a string. You don't parse that string until content_block_stop fires — that's your signal that the block is complete and safe to process.

Why wait for content_block_stop? Because a single message can contain multiple content blocks arriving in sequence. Parsing at message end would concatenate them incorrectly. Parsing at block stop means each tool call is handled as its own logical unit.

The Conversation State Problem

With basic streaming, the messages array was static — one request, one response. With tool use, it grows:

// After the model calls a tool:
messages = [
  { role: "user", content: "What's the weather in Chicago?" },
  { role: "assistant", content: [{ type: "tool_use", id: "...", name: "get_weather", input: { city: "Chicago" } }] },
  { role: "user", content: [{ type: "tool_result", tool_use_id: "...", content: "72°F, partly cloudy" }] }
]

If you don't append the tool calls and results to the conversation history before the next request, the model has no memory of what it called or what came back. The full conversation state must be sent with every turn.

This is also why the routing layer becomes a loop, not a single pass. Each iteration sends the current conversation state, streams the response, and either breaks (model is done) or continues (model called a tool and needs the result before it can finish).

Writing Tool Descriptions That Actually Work

Since the model selects tools based on their descriptions — not their implementations — description quality is where reliability lives. A useful mental model:

Code concept	Tool concept
Function signature	Tool name + schema
JSDoc / comments	Tool description + property descriptions
Function body	Your implementation (model never sees this)
Return value	The tool result you send back

The key difference from a function: the output is not deterministic. Given the same inputs, there's no guarantee the model selects the same tool or generates the same arguments every time. This is why description quality matters more than implementation quality for reliability.

When you have multiple tools and the model needs to pick between them, descriptions need to clearly differentiate. A good test: if a new engineer read only the description, would they know when to use this tool vs. the others?

// Too vague — model will guess when to use this
{
  name: "search",
  description: "Searches for information"
}

// Specific enough to be reliable
{
  name: "search_products",
  description: "Searches the product catalog by name, category, or SKU. Use this when the user is looking for a specific product or browsing a category. Do not use for order status, shipping, or account questions."
}

Why This Matters Beyond Demos

The tool use pattern shows up in almost every production LLM system worth building: agents that can query databases, orchestrators that delegate to specialist models, copilots that can take actions in your application. The underlying mechanism is always the same — the model generates intent, your code executes, the result flows back.

The engineering challenge isn't the API call. It's building the conversation state management, the result routing, and the error handling that makes this loop reliable at scale. Those are the parts that look easy in tutorials and break in production.

Why Streaming Changes How You Build LLM-Powered Interfaces

David Hahn — Wed, 03 Jun 2026 00:46:31 GMT

When I started building on the Anthropic API, the first thing I had to stop treating as a detail was streaming. It's easy to prototype an LLM feature with a standard request-response cycle — send a message, wait, render the result. It works fine until a real user is on the other end staring at a blank screen for three seconds.

Streaming is the difference between a chatbot that feels alive and one that feels like a form submission.

What Streaming Actually Is

Streaming generates and delivers text incrementally as the model produces it, rather than waiting for the full response to be ready. The transport mechanism is Server-Sent Events (SSE) — a one-way channel where the server pushes data to the client as it's available.

From a product perspective, this matters for two reasons:

Perceived latency drops dramatically. Users see the first token in under a second instead of waiting for the full response. Even if the total generation time is the same, the experience feels faster because something is happening immediately.

You can cancel early. If the model starts going in the wrong direction, the user (or your system) can cut the stream and save tokens. In agentic workflows where model calls chain together, this adds up.

When should you not use streaming? Backend pipelines where a machine is consuming the output, data extraction tasks where you need the complete structured response before you can do anything, and automated agent workflows where partial state is worse than no state. Streaming is a mechanism for incrementally delivering content — it's not always the right one.

The Bridge Problem

Here's where it gets interesting at the implementation level. The Anthropic SDK exposes streaming as an async iterator — it gives you chunks as they arrive through a for await loop. That's great for server-side code. But a Response object in a Next.js API route expects a ReadableStream, not an async iterator. They're not the same thing.

The solution is a ReadableStream wrapper that bridges the two:

const readable = new ReadableStream({
  async start(controller) {
    for await (const chunk of stream) {
      if (
        chunk.type === 'content_block_delta' &&
        chunk.delta.type === 'text_delta'
      ) {
        controller.enqueue(chunk.delta.text);
      }
    }
    controller.close();
  },
});

What this does: ReadableStream maintains an internal queue. As chunks arrive from the Anthropic stream, we filter for the ones we care about (content_block_delta events with text_delta type) and push them into that queue via controller.enqueue(). The client reads from the queue via reader.read(), which suspends when the queue is empty and resumes when a new chunk arrives. When controller.close() is called, read() returns done: true on the next call.

It's a producer/consumer pattern — the Anthropic stream is the fast producer, the client is the slower consumer, and ReadableStream is the buffer between them.

Understanding the Stream Structure

Every Anthropic streaming response follows the same envelope pattern:

message_start
  content_block_start
    content_block_delta
    content_block_delta
    ...
  content_block_stop
  content_block_start
    content_block_delta
    content_block_delta
  content_block_stop
message_stop

message_start carries the outer metadata — model, token usage, stop reason. The actual content lives in content_block_delta events. For basic text responses, you only need to track content_block_delta with type: 'text_delta'. Once you add tool use to the mix (covered in the next post), content_block_start becomes critical because it tells you what type of block is opening — text or a tool call.

Why This Matters for Production Systems

The ReadableStream wrapping pattern isn't just a TypeScript quirk — it's a concrete example of a problem that comes up constantly in applied AI work: the model's output format doesn't match what your application layer expects, and you need an adapter layer between them.

That adapter layer — whether it's a stream bridge, a JSON parser, or a response transformer — is often where the real engineering work happens in LLM-powered products. The model call itself is straightforward. Making its output integrate cleanly with the rest of your stack is not.