The "prompt engineering is dying" take is wrong
Every six months, someone declares prompt engineering dead because models are smarter. The reality is the opposite: as models get more capable, the cost of a bad prompt increases — because the model will confidently do the wrong thing at scale. What's changed is the discipline. Prompt engineering in 2026 is less "magic incantations" and more "software engineering with natural-language interfaces".Prompts are code, treat them as code
- Prompts live in version control, not in some cloud spreadsheet
- Each prompt has a name, a purpose, an input schema, and an output schema
- Changes go through review like code does
- The history of changes is preserved — a regression in production is debugged against the prompt git log, not by reconstruction
The "we'll iterate the prompt in the UI" approach works for a demo and breaks for a product.
Evals are the test suite
A prompt without an eval is a prompt you don't trust. The eval is a curated set of input examples with known-good (or known-acceptable) outputs, that runs every time the prompt or the underlying model changes.The pattern:
- Build an eval set per prompt, ~50-200 examples covering normal cases, edge cases, and known failure modes
- Score outputs against expected — exact match where appropriate, semantic similarity / LLM-as-judge for open-ended outputs
- Run the eval on every prompt change and every model change
- Track eval scores over time
- Block deploys that regress on the eval
We've shipped enough LLM features now that the eval discipline is non-negotiable. The teams that skip it ship fast and discover regressions after customers do.
Output structure — the contract
A prompt that returns "free-form text" is harder to use, harder to validate, and harder to test. The patterns that scale:- JSON output with a strict schema. Modern models (Claude, GPT-4-class) can be reliably constrained to schemas via tool use / structured outputs.
- Validate the output against the schema before consuming it. Reject malformed outputs with a retry.
- Define the schema in code, not just in the prompt. Keep them in sync.
The "let me parse this paragraph the model returned" code is the code that breaks first when the model upgrades.
Few-shot vs zero-shot vs CoT — when each helps
- Zero-shot — for tasks the model handles well by default. Less brittle than few-shot when the model is capable.
- Few-shot — when format precision matters, when the task has subtle conventions, when you're testing edge cases.
- Chain-of-thought — for tasks involving reasoning. The recent generation of "thinking" models has internalised a lot of CoT, but explicit CoT still helps for non-thinking models.
- Tool use — when the task requires structured operations beyond text generation (calculations, retrievals, side-effects).
The thing that breaks at scale
Prompts that work on 100 examples but fail on 10 000. The 1-2 % failure rate that was invisible during development becomes 200 broken outputs in production. Always test on a representative sample of real production inputs before considering a prompt done.Versioning the model itself
Prompt + model + parameters are the deployable unit. A prompt that worked on Claude Opus 4.7 might subtly drift on a future version. The discipline:- Pin model versions explicitly in production
- Run evals on the new version before upgrading
- Stage the rollout — small % first
- Track per-version eval scores
One pattern that always pays off
A regression test of "production inputs that produced a customer complaint". Every reported issue gets added to the eval set. The eval grows over time and becomes the most valuable test asset.One pattern we'd warn about
"Prompt golf" — making the prompt as short as possible. Clarity beats length. A 2 000-token prompt that's clear is better than a 200-token prompt that's clever.What's your prompt management stack? Promptfoo, LangSmith, in-house, or just git?