Skip to main content

Command Palette

Search for a command to run...

Error Handling in LLM Systems — Three Categories, One Decision Tree

Updated
7 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

Everything in the earlier modules assumed the happy path: the API responds, the model generates valid output, tools return results, the stream closes cleanly. In production, none of that is guaranteed. The API times out. The model wraps JSON in a greeting. A tool throws. Rate limits hit mid-session.

Without explicit error handling, any of these silently breaks the user experience or crashes the server. The goal isn't to prevent errors — they're unavoidable — it's to build a system that degrades gracefully, retries intelligently, and gives the user something useful when things go wrong.

The Three Categories

Most engineers handle one category of LLM errors and miss the other two. All three are distinct and require different responses.

Transient errors are temporary infrastructure failures: API timeouts, rate limits, momentary network drops. The characteristic of a transient error is that the same request, retried after a short wait, has a reasonable chance of succeeding. Examples: HTTP 429, 503, 502, connection reset.

Permanent errors are request or auth failures that won't resolve on retry. Retrying a 401 wastes API budget and adds latency to a failure that's already final. Examples: invalid API key (401), forbidden (403), malformed request payload (400).

Output errors are the category most tutorials omit entirely. The API call succeeded — HTTP 200, stream closed cleanly — but the output is wrong. Truncated responses from hitting max_tokens. Malformed JSON when using prompt-based structured output. Tool execution failures. These require different handling than transport errors because there's nothing to retry at the transport layer; the problem is in the content.

The Retry Utility

The core principle: the retry decision is determined by error type, not retry count. Count limits how many times you try. Type determines whether you should try at all.

export async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions = {}
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      const isLastAttempt = attempt === maxAttempts;
      if (isLastAttempt || !shouldRetry(error, attempt)) {
        throw error; // permanent error or exhausted attempts — stop
      }
      await delay(exponentialBackoff(attempt)); // transient — wait and retry
    }
  }
}

shouldRetry checks isTransientError first. A 401 doesn't match any transient pattern, falls through, returns false, and throws immediately. A 429 matches, returns true, and gets retried after a delay. The separation of "should I retry at all" from "how many times have I retried" keeps both concerns independently testable.

Exponential Backoff With Jitter

Each retry waits longer than the last:

const baseDelay = Math.min(initialDelayMs * Math.pow(2, attempt - 1), maxDelayMs);
const jitter = Math.random() * 0.3 * baseDelay;
const delayMs = Math.round(baseDelay + jitter);

The jitter is the part that's easy to skip and worth not skipping. If multiple clients all hit a rate limit simultaneously and retry at identical intervals, they'll hammer the server again in lockstep. The synchronized spike is called a thundering herd — it can trigger the same rate limit that caused the original failure, creating a feedback loop. Adding a small random offset (here, up to 30% of the base delay) spreads retries across a window and breaks the synchronization.

For a single user this doesn't matter much. For any system running parallel requests or serving multiple users — which applies to every production LLM product — jitter is the difference between exponential backoff that actually relieves pressure and exponential backoff that just delays the next spike.

Detecting Permanent vs. Transient Errors

function isTransientError(error: unknown): boolean {
  const message = (error as Error).message.toLowerCase();
  return (
    message.includes("429") ||
    message.includes("rate limit") ||
    message.includes("timeout") ||
    message.includes("503") ||
    message.includes("502")
  );
}

This is intentionally conservative — it only matches patterns that are clearly transient. Anything that doesn't match (including 401s, 403s, and any unknown error) falls through to non-retriable. The safer default in error classification is to not retry, because a retry on a permanent error is strictly worse than throwing immediately: it adds latency and burns API budget on a request that was already failed.

Handling Output Errors

Truncated responses: check stop_reason

stop_reason === "max_tokens" means the model ran out of tokens mid-response. The HTTP call succeeded, the stream closed, but the content is incomplete. Without checking this field the truncated response gets returned silently.

if (response.stop_reason === "max_tokens") {
  // retry with a higher max_tokens budget
}

This is one of the more common silent failures in LLM systems. Long prompts with large context windows are particularly susceptible — the model can hit the token limit well into the response with no visible indication other than the output ending abruptly mid-sentence.

Malformed JSON: extract before giving up

When using prompt-based structured output (not forced tool use), the model occasionally wraps JSON in a greeting, explanation, or markdown block. Before failing, attempt extraction:

try {
  return JSON.parse(text);
} catch {
  const jsonMatch = text.match(/\{[\s\S]*\}/);
  if (jsonMatch) {
    return JSON.parse(jsonMatch[0]);
  }
  throw new Error("Could not extract valid JSON from response");
}

This is also the argument for the forced tool use pattern from 04-structured-output. The model's tool-calling pathway is specifically trained for schema compliance — it doesn't produce mixed content. Prompt-based JSON requires this kind of defensive extraction. If structured output is feeding downstream application logic, the reliability difference is significant enough to prefer forced tool use.

Tool failures: pass the error through tool_result

When a tool execution fails, the instinct is to catch the error and handle it in application code. The better approach is to send the error back to the model through the normal tool_result channel:

try {
  toolResultContent = await executeTool(name, input);
} catch (err) {
  toolResultContent = JSON.stringify({
    error: (err as Error).message,
    retry: true
  });
}

messages.push({
  role: "user",
  content: [{ type: "tool_result", tool_use_id: id, content: toolResultContent }]
});

The model now has the error in its conversation context. It can explain to the user what failed, suggest alternatives, or ask to retry with different parameters — rather than silently returning a degraded response while the application pretends the tool succeeded.

In a test with a deliberately flaky tool that returned intermittent failures, the model received two consecutive errors and accurately summarized both in its final response: "I tried to look that up twice and hit an error each time — you may want to try again in a moment." It had full context for the failure history because both error responses were in the messages array. The model handled the degraded path better than most application code would.

The Decision Tree

When something goes wrong in an LLM system, the question to answer in order:

Error occurred
├── Is it transient? (timeout, rate limit, 5xx)
│   └── Yes → retry with exponential backoff + jitter
└── No — is it permanent? (401, 403, malformed request)
    ├── Yes → throw immediately, don't retry
    └── No — is it an output error?
        ├── Truncated → retry with higher max_tokens
        ├── Malformed JSON → extract or retry with better prompt / switch to forced tool use
        └── Tool failure → pass error to model via tool_result, let model reason over it

The structure matters because collapsing these into a single retry loop — the most common mistake — means permanent errors get retried (wasting budget), output errors get treated as transport failures (wrong fix), and tool failures get silently swallowed (broken UX with no explanation).


This is part of a series on building applied AI systems. Start from the beginning with streaming →

More from this blog

D

David Hahn | Applied AI Engineering

15 posts

Real implementations, real bugs, and the mental models behind building on LLMs. I'm a fullstack engineer with 10+ years of experience, including 4 years at Apple, now going deep on applied AI: streaming, RAG, tool use, agents, and the full stack of tooling modern AI products are built on.