Prompt engineering as a software discipline: versioning, evals, a

Aior · May 1, 2026

"Prompt engineering ölüyor" görüşü yanlış

Her altı ayda bir, modeller daha akıllı diye birisi prompt engineering'in öldüğünü ilan eder. Gerçeklik tersine: modeller daha yetkin oldukça kötü promptun maliyeti artar — çünkü model ölçekte yanlış şeyi güvenle yapacaktır. Değişen şey disiplindir. 2026'da prompt engineering daha az "sihirli incantation" ve daha çok "doğal dil arayüzleriyle yazılım mühendisliği"dir.

Promptlar koddur, kod gibi davranın

Promptlar bir bulut spreadsheet'inde değil, sürüm kontrolünde yaşar.
Her promptun bir adı, amacı, giriş şeması ve çıkış şeması vardır.
Değişiklikler kod gibi incelemeden geçer.
Değişiklik geçmişi korunur — üretimde bir regresyon rekonstrüksiyon ile değil, prompt git log'una karşı debug edilir.

"UI'da prompt'u iterasyona alacağız" yaklaşımı demo için çalışır ve ürün için kırılır.

Eval'lar test paketidir

Eval'i olmayan prompt güvendiğiniz prompt değildir. Eval, prompt veya alttaki model her değiştiğinde çalışan bilinen-iyi (veya bilinen-kabul edilebilir) çıktılarla küratör edilmiş giriş örnekleri setidir.

Desen:

Prompt başına ~50-200 örnek eval seti inşa edin, normal durumlar, sınır durumlar ve bilinen başarısızlık modlarını kapsayan.
Çıktıları beklenenle skorlayın — uygun yerlerde exact match, açık uçlu çıktılar için semantic similarity / LLM-as-judge.
Her prompt değişikliğinde ve her model değişikliğinde eval'i çalıştırın.
Eval skorlarını zaman içinde takip edin.
Eval'da regresyon eden deploy'ları engelleyin.

Artık yeterince LLM özelliği gönderdik ki eval disiplini tartışılmaz. Atlamayı seçen ekipler hızlı gönderir ve müşterilerden sonra regresyonları keşfeder.

Çıktı yapısı — kontrat

"Serbest form metin" döndüren prompt kullanması daha zor, doğrulaması daha zor ve test etmesi daha zordur. Ölçeklenen desenler:

Sıkı şemalı JSON çıktı. Modern modeller (Claude, GPT-4 sınıfı) tool use / structured outputs ile şemalara güvenle kısıtlanabilir.
Tüketmeden önce çıktıyı şemaya karşı doğrulayın. Bozuk çıktıları retry ile reddedin.
Şemayı kodda tanımlayın, sadece prompt'ta değil. Senkronize tutun.

"Modelin döndürdüğü paragrafı parse edeyim" kodu, model yükseltildiğinde ilk kırılan koddur.

Few-shot vs zero-shot vs CoT — her birinin yardım ettiği yer

Zero-shot — modelin varsayılan olarak iyi ele aldığı görevler için. Model yetkinse few-shot'tan daha az kırılgan.
Few-shot — format hassasiyeti önemli olduğunda, görevin ince conventions'ı olduğunda, sınır durumları test ediyorken.
Chain-of-thought — akıl yürütme içeren görevler için. Yeni nesil "thinking" modeller çok CoT'yi içselleştirdi, ama açık CoT thinking olmayan modeller için hâlâ yardım eder.
Tool use — görev metin üretimi ötesinde yapılandırılmış operasyonlar gerektirdiğinde (hesaplamalar, retrieval'lar, yan etkiler).

Ölçekte kırılan şey

100 örnek üzerinde çalışan ama 10.000'de başarısız olan promptlar. Geliştirme sırasında görünmez olan %1-2 başarısızlık oranı üretimde 200 bozuk çıktı olur. Promptu tamamlanmış saymadan önce her zaman gerçek üretim girişlerinin temsili örneğinde test edin.

Modelin kendisini versiyonlama

Prompt + model + parametreler deployable birimdir. Claude Opus 4.7'de çalışan bir prompt gelecek sürümde ince şekilde sürüklenebilir.

Üretimde model sürümlerini açıkça pin'leyin.
Yükseltmeden önce yeni sürümde eval'ları çalıştırın.
Rollout'u aşamalandırın — önce küçük %.
Sürüm başına eval skorlarını takip edin.

Her zaman karşılığını veren bir desen

"Müşteri şikâyeti üreten üretim girişleri" regression testi. Bildirilen her sorun eval setine eklenir.

Uyaracağımız bir desen

"Prompt golf" — promptu mümkün olduğunca kısa yapmak. Netlik uzunluğu yener. Net olan 2.000 token'lık prompt, akıllı olan 200 token'lık prompttan iyidir.

Prompt yönetim yığınınız nedir? Promptfoo, LangSmith, in-house veya sadece git?

The "prompt engineering is dying" take is wrong

Every six months, someone declares prompt engineering dead because models are smarter. The reality is the opposite: as models get more capable, the cost of a bad prompt increases — because the model will confidently do the wrong thing at scale. What's changed is the discipline. Prompt engineering in 2026 is less "magic incantations" and more "software engineering with natural-language interfaces".

Prompts are code, treat them as code

Prompts live in version control, not in some cloud spreadsheet.
Each prompt has a name, a purpose, an input schema, and an output schema.
Changes go through review like code does.
The history of changes is preserved — a regression in production is debugged against the prompt git log, not by reconstruction.

The "we'll iterate the prompt in the UI" approach works for a demo and breaks for a product.

Evals are the test suite

A prompt without an eval is a prompt you don't trust. The eval is a curated set of input examples with known-good (or known-acceptable) outputs, that runs every time the prompt or the underlying model changes.

The pattern:

Build an eval set per prompt, ~50-200 examples covering normal cases, edge cases, and known failure modes.
Score outputs against expected — exact match where appropriate, semantic similarity / LLM-as-judge for open-ended outputs.
Run the eval on every prompt change and every model change.
Track eval scores over time.
Block deploys that regress on the eval.

Output structure — the contract

A prompt that returns "free-form text" is harder to use, harder to validate, and harder to test. The patterns that scale:

JSON output with a strict schema. Modern models can be reliably constrained to schemas via tool use / structured outputs.
Validate the output against the schema before consuming it. Reject malformed outputs with a retry.
Define the schema in code, not just in the prompt. Keep them in sync.

Few-shot vs zero-shot vs CoT

Zero-shot — for tasks the model handles well by default.
Few-shot — when format precision matters, when the task has subtle conventions, when you're testing edge cases.
Chain-of-thought — for tasks involving reasoning.
Tool use — when the task requires structured operations beyond text generation.

The thing that breaks at scale

Prompts that work on 100 examples but fail on 10 000. The 1-2 % failure rate that was invisible during development becomes 200 broken outputs in production.

Versioning the model itself

Prompt + model + parameters are the deployable unit.

Pin model versions explicitly in production.
Run evals on the new version before upgrading.
Stage the rollout — small % first.
Track per-version eval scores.

One pattern that always pays off

A regression test of "production inputs that produced a customer complaint". Every reported issue gets added to the eval set.

One pattern we'd warn about

"Prompt golf" — making the prompt as short as possible. Clarity beats length.

What's your prompt management stack? Promptfoo, LangSmith, in-house, or just git?

Prompt engineering as a software discipline: versioning, evals, a

Prompt engineering as a software discipline: versioning, evals, a

Aior

Administrator

"Prompt engineering ölüyor" görüşü yanlış

Promptlar koddur, kod gibi davranın

Eval'lar test paketidir

Çıktı yapısı — kontrat

Few-shot vs zero-shot vs CoT — her birinin yardım ettiği yer

Ölçekte kırılan şey

Modelin kendisini versiyonlama

Her zaman karşılığını veren bir desen

Uyaracağımız bir desen

The "prompt engineering is dying" take is wrong

Prompts are code, treat them as code

Evals are the test suite

Output structure — the contract

Few-shot vs zero-shot vs CoT

The thing that breaks at scale

Versioning the model itself

One pattern that always pays off

One pattern we'd warn about

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Prompt engineering as a software discipline: versioning, evals, a

Prompt engineering as a software discipline: versioning, evals, a

Aior

Administrator

"Prompt engineering ölüyor" görüşü yanlış​

Promptlar koddur, kod gibi davranın​

Eval'lar test paketidir​

Çıktı yapısı — kontrat​

Few-shot vs zero-shot vs CoT — her birinin yardım ettiği yer​

Ölçekte kırılan şey​

Modelin kendisini versiyonlama​

Her zaman karşılığını veren bir desen​

Uyaracağımız bir desen​

The "prompt engineering is dying" take is wrong​

Prompts are code, treat them as code​

Evals are the test suite​

Output structure — the contract​

Few-shot vs zero-shot vs CoT​

The thing that breaks at scale​

Versioning the model itself​

One pattern that always pays off​

One pattern we'd warn about​

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

"Prompt engineering ölüyor" görüşü yanlış

Promptlar koddur, kod gibi davranın

Eval'lar test paketidir

Çıktı yapısı — kontrat

Few-shot vs zero-shot vs CoT — her birinin yardım ettiği yer

Ölçekte kırılan şey

Modelin kendisini versiyonlama

Her zaman karşılığını veren bir desen

Uyaracağımız bir desen

The "prompt engineering is dying" take is wrong

Prompts are code, treat them as code

Evals are the test suite

Output structure — the contract

Few-shot vs zero-shot vs CoT

The thing that breaks at scale

Versioning the model itself

One pattern that always pays off

One pattern we'd warn about