SRE in practice: incidents, on-call, error budgets, and the discipline th

Aior · Apr 30, 2026

SRE bir disiplindir, iş ünvanı değil

"Site Reliability Engineering" yıllar önce buzzword oldu. Gerçek pratik — taahhüt edilmiş güvenilirlik hedefleri, error budget'lar, post-incident incelemeler, on-call disiplini — gerçekten değerli.

SLO'lar, "uptime" değil

"%99.9 uptime" bağlam olmadan anlamsız. Faydalı bir SLO belirler:

Ölçtüğü kullanıcı yolculuğu.
Başarı kriteri.
Hedef — 28 günlük pencere üzerinde %99.5, %99.9, %99.95.
Ölçüm metodolojisi.

Error budget'lar — müzakere aracı

Error budget = (1 - SLO) × zaman. 28 günlük %99.9 SLO = ~40 dakika "izinli güvensizlik".

Bütçe sağlıklıyken → özellikleri hızlı gönder, kontrollü riskler al.
Bütçe tükendiğinde → özellikleri dondur, güvenilirlik işine odaklan.
Karar mekanik, politik değil.

On-call: disiplin rotasyondan önemli

Primary + secondary.
Bir haftalık rotasyon.
Ödendi.
Kötü bir on-call sonrası izin.
On-call devirleri — açık incident'ler, devam eden araştırmalar.

Incident yaşam döngüsü

Tespit — alert ateşler veya kullanıcı sorun bildirir.
Triyaj — primary on-call 5 dakika içinde değerlendirir.
Mobilizasyon — yüksek-severity için incident kanalı açılır, IC atanır.
Mitigation — tam anlaşılmasa bile servisi geri yükle.
İletişim — kullanıcılara, dahili paydaşlara.
Post-incident inceleme — 5 iş günü içinde, blameless.

Post-incident incelemeler — doğru yap

Blameless — hedef sistem iyileştirmesi.
Sahipler ve son tarihlerle somut aksiyon öğeleri üretir.
Aksiyon öğeleri tamamlanmaya kadar takip edilir.
Incident'ler arası desenler tanımlanır ve ele alınır.

Toil — ekibi gizlice öldüren iş

Toil = manuel, tekrarlayan, kalıcı değeri olmayan iş. SRE disiplini toil'i takip eder. Toil ekip zamanının ~%50'sini aştığında ekip operasyonlara dönüşmüştür.

Chaos engineering — ne zaman değerli

Ekibin sağlam observability'si var.
Ekibin SLO'ları var.
Sistem başarısızlıkları varsayacak kadar olgun.

Uyaracağımız bir desen

Otorite vermeden SRE işe almak.

Her zaman karşılığını veren bir desen

On-call inceleme toplantısı. Haftalık, on-call mühendis vardiyasında ne olduğunu anlatır.

Incident süreciniz nedir?

SRE is a discipline, not a job title

"Site Reliability Engineering" became a buzzword years ago. The actual practice — committed reliability targets, error budgets, post-incident reviews, on-call discipline — is genuinely valuable.

SLOs, not "uptime"

"99.9% uptime" is meaningless without context:

The user journey it measures.
The success criterion.
The target over a defined window.
The measurement methodology.

Error budgets — the negotiation tool

Error budget = (1 - SLO) × time. 99.9% over 28 days = ~40 minutes.

Budget healthy → ship features fast.
Budget exhausted → freeze features.
The decision is mechanical, not political.

On-call: discipline matters more than rotation

Primary + secondary.
One-week rotations.
Compensated.
Time off after a bad on-call.
On-call handoffs.

The incident lifecycle

Detection.
Triage — within 5 minutes.
Mobilisation — Incident Commander assigned.
Mitigation — restore service first.
Communication.
Post-incident review — within 5 business days, blameless.

Post-incident reviews

Blameless.
Concrete action items with owners.
Action items tracked to completion.
Patterns across incidents identified.

Toil

Toil = manual, repetitive, no-lasting-value work. SRE tracks toil. When toil exceeds ~50% of team time, the team has become operations.

Chaos engineering — when worth it

Solid observability.
SLOs exist.
System mature enough to handle failures.

One pattern we'd warn about

Hiring SREs without giving them authority.

One pattern that always pays off

The on-call review meeting.

What's your incident process?

SRE in practice: incidents, on-call, error budgets, and the discipline th

SRE in practice: incidents, on-call, error budgets, and the discipline th

Aior

Administrator

SRE bir disiplindir, iş ünvanı değil

SLO'lar, "uptime" değil

Error budget'lar — müzakere aracı

On-call: disiplin rotasyondan önemli

Incident yaşam döngüsü

Post-incident incelemeler — doğru yap

Toil — ekibi gizlice öldüren iş

Chaos engineering — ne zaman değerli

Uyaracağımız bir desen

Her zaman karşılığını veren bir desen

SRE is a discipline, not a job title

SLOs, not "uptime"

Error budgets — the negotiation tool

On-call: discipline matters more than rotation

The incident lifecycle

Post-incident reviews

Toil

Chaos engineering — when worth it

One pattern we'd warn about

One pattern that always pays off

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

SRE in practice: incidents, on-call, error budgets, and the discipline th

SRE in practice: incidents, on-call, error budgets, and the discipline th

Aior

Administrator

SRE bir disiplindir, iş ünvanı değil​

SLO'lar, "uptime" değil​

Error budget'lar — müzakere aracı​

On-call: disiplin rotasyondan önemli​

Incident yaşam döngüsü​

Post-incident incelemeler — doğru yap​

Toil — ekibi gizlice öldüren iş​

Chaos engineering — ne zaman değerli​

Uyaracağımız bir desen​

Her zaman karşılığını veren bir desen​

SRE is a discipline, not a job title​

SLOs, not "uptime"​

Error budgets — the negotiation tool​

On-call: discipline matters more than rotation​

The incident lifecycle​

Post-incident reviews​

Toil​

Chaos engineering — when worth it​

One pattern we'd warn about​

One pattern that always pays off​

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

SRE bir disiplindir, iş ünvanı değil

SLO'lar, "uptime" değil

Error budget'lar — müzakere aracı

On-call: disiplin rotasyondan önemli

Incident yaşam döngüsü

Post-incident incelemeler — doğru yap

Toil — ekibi gizlice öldüren iş

Chaos engineering — ne zaman değerli

Uyaracağımız bir desen

Her zaman karşılığını veren bir desen

SRE is a discipline, not a job title

SLOs, not "uptime"

Error budgets — the negotiation tool

On-call: discipline matters more than rotation

The incident lifecycle

Post-incident reviews

Toil

Chaos engineering — when worth it

One pattern we'd warn about

One pattern that always pays off