Skip to main content

Command Palette

Search for a command to run...

Structured Output — When to Use Prompting vs. Forced Tool Use

Updated
4 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

Most LLM features eventually need the model to return structured data like a JSON object your application can parse and act on. Connecting LLM output to a database, a UI component, or a downstream API requires structure you can depend on.

There are two approaches. Choosing the right one changes how reliable your system is.

Approach 1: Prompt-Based

Tell the model in the system prompt to return JSON matching a specific shape.

const systemPrompt = `
You are a grading assistant. Return your evaluation as a JSON object with this structure:
{
  "score": number (0-10),
  "passed": boolean,
  "feedback": string
}
Return only the JSON object. No markdown, no explanation, no code blocks.
`;

This works most of the time. The model is good at following formatting instructions for common shapes. The problem is "most of the time". Occasionally, it wraps the response in markdown code blocks, adds a preamble sentence, or returns a subtly malformed object. You end up parsing defensively:

const text = response.content[0].text;
const cleaned = text.replace(/```json|```/g, '').trim();
const parsed = JSON.parse(cleaned);

For low-stakes or high-volume use cases where occasional failures are acceptable, prompt-based is fine. It's also simpler to iterate on — just edit the prompt.

Approach 2: Forced Tool Use

Define a tool whose input schema is exactly the shape you want, then force the model to call it:

tools: [{
  name: "submit_evaluation",
  description: "Submit the structured evaluation result",
  input_schema: {
    type: "object",
    properties: {
      score: {
        type: "number",
        description: "Score from 0-10 where 0 is completely wrong and 10 is perfectly accurate"
      },
      passed: { type: "boolean" },
      feedback: { type: "string", description: "One to two sentences of specific, actionable feedback for the learner" }
    },
    required: ["score", "passed", "feedback"]
  }
}],
tool_choice: { type: "tool", name: "submit_evaluation" }

With tool_choice forcing the model to call this tool, the response is always valid JSON conforming to the schema. The model's tool-calling pathway is specifically trained for schema compliance. You extract the input directly:

const toolUse = response.content.find(b => b.type === "tool_use");
return toolUse.input as EvaluationResult;

The tool never "runs" anything. Its only purpose is to give the model a schema to conform to.

When Property Descriptions Carry All the Weight

With tool_choice forcing the call, the top-level tool description matters less — the model has no choice but to use it. What matters are the property descriptions, because those guide what value the model generates for each field.

// Vague — model guesses what's expected
score: { type: "number" }

// Specific — model knows exactly what scale to use and what the extremes mean
score: {
  type: "number",
  description: "Score from 0-10 where 0 is completely wrong and 10 is perfectly accurate. Use the full range — a partial correct answer should score 4-6, not cluster near the top."
}

This is a direct parallel to how description quality drives tool selection reliability in multi-tool setups. Whether the model is choosing which tool to call or filling in a field value, the description is the only lever you have to influence the output toward what you actually want.

How to Debug When the Model Picks Wrong

When you're not using tool_choice and the model selects the wrong tool or generates an unexpected value, there's no stack trace. The decision is opaque. The places to look:

  1. Tool and property descriptions: is there enough differentiation between tools? Are the property descriptions specific enough about valid values?

  2. System prompt: can you add explicit ordering instructions ("always call X before Y")?

  3. tool_choice override: if a specific call must happen, force it rather than nudging

You're always nudging when you're not using tool_choice. Evals matter here because you can't inspect the decision — you can only measure outcomes across many runs and catch regressions.

The Decision Framework

Situation Approach
Simple, well-defined schema, failures tolerable Prompt-based
Complex schema, downstream systems depend on it Forced tool use
You need to guarantee a specific call happens Forced tool use with tool_choice
Multiple structured output types in one call Define multiple tools, let model choose
High iteration speed matters most Prompt-based (faster to edit)

The tradeoff is always reliability vs. flexibility. Forced tool use is more reliable but locks you into a schema. Prompt-based is easier to iterate but requires defensive parsing. In production systems where structured output feeds other components, the reliability is usually worth it.

More from this blog