Structured Output — When to Use Prompting vs. Forced Tool Use

Most LLM features eventually need the model to return structured data like a JSON object your application can parse and act on. Connecting LLM output to a database, a UI component, or a downstream API requires structure you can depend on.

There are two approaches. Choosing the right one changes how reliable your system is.

Approach 1: Prompt-Based

Tell the model in the system prompt to return JSON matching a specific shape.

const systemPrompt = `
You are a grading assistant. Return your evaluation as a JSON object with this structure:
{
  "score": number (0-10),
  "passed": boolean,
  "feedback": string
}
Return only the JSON object. No markdown, no explanation, no code blocks.
`;

This works most of the time. The model is good at following formatting instructions for common shapes. The problem is "most of the time". Occasionally, it wraps the response in markdown code blocks, adds a preamble sentence, or returns a subtly malformed object. You end up parsing defensively:

const text = response.content[0].text;
const cleaned = text.replace(/```json|```/g, '').trim();
const parsed = JSON.parse(cleaned);

For low-stakes or high-volume use cases where occasional failures are acceptable, prompt-based is fine. It's also simpler to iterate on — just edit the prompt.

Approach 2: Forced Tool Use

Define a tool whose input schema is exactly the shape you want, then force the model to call it:

tools: [{
  name: "submit_evaluation",
  description: "Submit the structured evaluation result",
  input_schema: {
    type: "object",
    properties: {
      score: {
        type: "number",
        description: "Score from 0-10 where 0 is completely wrong and 10 is perfectly accurate"
      },
      passed: { type: "boolean" },
      feedback: { type: "string", description: "One to two sentences of specific, actionable feedback for the learner" }
    },
    required: ["score", "passed", "feedback"]
  }
}],
tool_choice: { type: "tool", name: "submit_evaluation" }

With tool_choice forcing the model to call this tool, the response is always valid JSON conforming to the schema. The model's tool-calling pathway is specifically trained for schema compliance. You extract the input directly:

const toolUse = response.content.find(b => b.type === "tool_use");
return toolUse.input as EvaluationResult;

The tool never "runs" anything. Its only purpose is to give the model a schema to conform to.

When Property Descriptions Carry All the Weight

With tool_choice forcing the call, the top-level tool description matters less — the model has no choice but to use it. What matters are the property descriptions, because those guide what value the model generates for each field.

// Vague — model guesses what's expected
score: { type: "number" }

// Specific — model knows exactly what scale to use and what the extremes mean
score: {
  type: "number",
  description: "Score from 0-10 where 0 is completely wrong and 10 is perfectly accurate. Use the full range — a partial correct answer should score 4-6, not cluster near the top."
}

This is a direct parallel to how description quality drives tool selection reliability in multi-tool setups. Whether the model is choosing which tool to call or filling in a field value, the description is the only lever you have to influence the output toward what you actually want.

How to Debug When the Model Picks Wrong

When you're not using tool_choice and the model selects the wrong tool or generates an unexpected value, there's no stack trace. The decision is opaque. The places to look:

Tool and property descriptions: is there enough differentiation between tools? Are the property descriptions specific enough about valid values?
System prompt: can you add explicit ordering instructions ("always call X before Y")?
tool_choice override: if a specific call must happen, force it rather than nudging

You're always nudging when you're not using tool_choice. Evals matter here because you can't inspect the decision — you can only measure outcomes across many runs and catch regressions.

The Decision Framework

Situation	Approach
Simple, well-defined schema, failures tolerable	Prompt-based
Complex schema, downstream systems depend on it	Forced tool use
You need to guarantee a specific call happens	Forced tool use with `tool_choice`
Multiple structured output types in one call	Define multiple tools, let model choose
High iteration speed matters most	Prompt-based (faster to edit)

The tradeoff is always reliability vs. flexibility. Forced tool use is more reliable but locks you into a schema. Prompt-based is easier to iterate but requires defensive parsing. In production systems where structured output feeds other components, the reliability is usually worth it.

Structured Output — When to Use Prompting vs. Forced Tool Use

Approach 1: Prompt-Based

Approach 2: Forced Tool Use

When Property Descriptions Carry All the Weight

How to Debug When the Model Picks Wrong

The Decision Framework

Comments

Anthropic API

Building RAG from Scratch — Embeddings, pgvector, and a Bug Worth Knowing

More from this blog

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Error Handling in LLM Systems — Three Categories, One Decision Tree

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

Streaming Structured Output — Incremental JSON Rendering Without a Parser

Evals — Why a Bad Eval Is Worse Than No Eval

Command Palette

Approach 1: Prompt-Based

Approach 2: Forced Tool Use

When Property Descriptions Carry All the Weight

How to Debug When the Model Picks Wrong

The Decision Framework

Comments

Anthropic API

Building RAG from Scratch — Embeddings, pgvector, and a Bug Worth Knowing

More from this blog