Prompt Engineering Lessons from Building a Problem Generator
Prompt engineering only becomes interesting when you're prompting for structured, specific output that has to be reliable enough to feed downstream systems. A chatbot that gives a slightly different answer each time is fine. A problem generator that occasionally returns vague prompts, missing constraints, or malformed JSON breaks the entire study session.
This post documents what I learned building the problem generation component of my study system, where a prompt had to produce interview-caliber problems with consistent structure on every run.
What the Component Needs to Produce
The output isn't a free form answer: it's a structured problem that can be immediately worked on:
{
"topic": "React state management",
"difficulty": "medium",
"prompt": "Build a shopping cart component that...",
"constraints": ["Must handle concurrent updates", "No external state library"],
"examples": [
{ "input": "addItem({ id: 1, name: 'Widget', price: 9.99 })", "output": "Cart: 1 item, $9.99 total" }
],
"setup_code": "const initialCart = { items: [], total: 0 };"
}
If any field is missing or vague, the downstream session breaks. That constraint forced me to treat output reliability as a first-class requirement from the start.
Iteration 1: The Prompt Was Too Rigid
The first version of the generation prompt was fully hardcoded: topic, difficulty, and output structure baked in as static text. It worked but couldn't adapt. As the system matured to accept history, difficulty calibration, and topic filtering, a rigid prompt became a bottleneck.
The lesson: prompts for production components should be built as functions, not strings. Dynamic inputs (topic, difficulty, past problems, constraint filters) compose into the prompt at call time. The static portions are the instructions and schema. The dynamic portions are the context.
The Model Sometimes Ignores Formatting Instructions
I explicitly told the model not to wrap the response in markdown code blocks. It ignored this instruction often enough to be a problem — not on every call, but on enough that I couldn't rely on clean JSON coming back.
The fix is defensive stripping in addition to the instruction, not instead of it:
raw = response.content[0].text
cleaned = raw.strip()
if cleaned.startswith("```"):
cleaned = "\n".join(cleaned.split("\n")[1:])
if cleaned.endswith("```"):
cleaned = "\n".join(cleaned.split("\n")[:-1])
result = json.loads(cleaned)
The instruction reduces frequency; the stripping handles the remainder. Both are necessary. This is a general pattern: for any structured output that feeds application code, you need both a clear prompt instruction and a parsing layer that handles model non-compliance gracefully.
Scope Decisions That Saved Time
Two scope decisions I explicitly deferred kept Phase 1 on track:
setup_code as a string, not an array. The ideal design for multi-file problems (HTML + CSS, for example) would be an array of file objects. But for a CLI-based tool in the current scope, a single string is sufficient and eliminates the complexity of file-aware rendering. I documented this as a known limitation to revisit.
Problem formatting as plain text, not HTML. The model naturally wanted to format problems with rich structure. Useful eventually, but not useful when the output surface is a terminal. Deferring this prevented UI work from blocking the core prompt engineering work.
Both decisions share a pattern: identify the simplification that keeps the current phase moving without closing doors, and write down what you're deferring and why.
When Past Problems Scope Outgrew the Component
One feature I initially planned to include in the generator (passing in past problems to avoid repetition) revealed a scope issue mid-build. Avoiding repetition is a scheduling concern, not a generation concern. The generator should generate; a higher-level orchestrator should decide whether to generate a new problem or surface a review problem from history.
Keeping the generator's interface narrow (topic + difficulty → structured problem) made it more composable and easier to test in isolation. The past problem logic belongs in the component that calls the generator, not inside it.
This is a prompt engineering lesson as much as a software design one: if your prompt is growing to handle multiple concerns, it's often a sign the abstraction is wrong, not that the prompt needs more instructions.
What This Looks Like in Practice
After several iterations, the generator produces problems specific enough to start immediately and structured enough to parse reliably. More importantly, the prompt is a function — inputs compose cleanly, the output schema is stable, and failures are handled defensively at the parsing layer.
The study system is built on top of this component's reliability. If the generator is flaky, every downstream session is degraded. Treating prompt reliability as a first-class engineering concern — not just "get the words right" but "design a prompt that fails gracefully and consistently" — is the lesson that transferred most directly to how I think about building on LLMs in general.

