İçeriğe geç
KAMPANYA

Logo Tasarım + Web Tasarım + 1 Yıl Domain + E-posta + Hosting — $299 +KDV

AIOR

Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Sektör topluluğu — sorularınız, deneyimleriniz ve duyurularınız için.

Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Aior

Administrator
Staff member
Joined
Apr 2, 2023
Messages
175
Reaction score
2
Points
18
Age
40
Location
Turkey
Website
aior.com
1/3
Thread owner
500


Three signals, one stack​

Modern observability is metrics + logs + traces. Each tells you something the others can't, and the value compounds when they're cross-linked. The Prometheus + Loki + Tempo + Grafana ("PLTG") stack has emerged as the open-source default in 2026, and for good reason — it's mature, integrated, and the pricing model (you operate it) is honest.

Below is what we ship.

Metrics: Prometheus + the questions to ask​

Prometheus is the de-facto metrics tool for cloud-native infrastructure. Cardinal questions:
  • What's your retention requirement? Local Prometheus = 15 days default. Long-term (months / years) needs Thanos or Mimir or Cortex. Pick early; migration is non-trivial.
  • What's your cardinality budget? Each unique label combination is a time series; high-cardinality labels (user_id, request_id) blow up storage. Discipline at instrumentation time pays off forever.
  • Pull or push? Prometheus is pull-based by default. For ephemeral jobs that don't live long enough to be scraped, the Pushgateway is the workaround.

Logs: Loki — when to use it, when not to​

Loki is the log-aggregation companion. Indexes labels but not log content. The trade-off:
  • Where it wins — when you query by a small set of labels (service, environment, level) and grep within. Cost-effective at scale.
  • Where it hurts — full-text search across years of logs is slower than purpose-built log search (ELK, Splunk).

For most infrastructure logs, Loki is right. For full-text-search-heavy use cases (security forensics on years of logs), Elasticsearch / OpenSearch is still preferred.

Traces: Tempo — for distributed systems​

Tempo (or Jaeger, or OpenTelemetry-compatible alternatives) gives you per-request latency breakdowns across service boundaries. The questions:
  • Sampling — keeping every trace gets expensive. Head sampling (decide at the start) misses interesting traces; tail sampling (decide after) is more useful but harder to operate.
  • Instrumentation — auto-instrumentation libraries cover most frameworks. Manual instrumentation for the spans that matter operationally.
  • Cross-service correlation — propagate trace context (W3C Trace Context) at every service boundary. Without it, traces are per-service and lose the cross-service value.

The dashboard discipline​

A dashboard wall is a sign of an organisation that doesn't know what it's looking at. The structure that works:
  • Per-service operational dashboards — RED metrics (Rate, Errors, Duration) for the service. Owned by the team.
  • Per-feature dashboards — for a critical user-facing feature, what's its end-to-end health.
  • SLO / error budget dashboards — for the SLOs the team has committed to.
  • Org-level overviews — for leadership, the few KPIs that matter at company level.

Each dashboard has a documented audience and a documented "if you see X, do Y" runbook reference.

Alerting that doesn't burn the team out​

The alert philosophy that works:
  • Alert on symptoms, not causes. "User-facing latency exceeded SLO" not "CPU is at 80 %".
  • Each alert points to a runbook.
  • Each alert has an owner.
  • Each alert has tested escalation paths.
  • Quarterly alert review — anything that fires often without action gets tuned or deleted.

The single biggest predictor of on-call quality of life: alert hygiene. Teams with quiet, actionable on-call have done the discipline; teams with constantly-paging on-call haven't.

The cost of observability​

At scale, observability becomes a meaningful cost line. The patterns:
  • Sample / drop high-cardinality metrics that aren't actionable
  • Drop debug-level logs from production unless an investigation is open
  • Tail-sample traces, keeping the slow ones and a baseline of normal ones
  • Tier storage — hot for recent, cold for old. S3-backed cold tier is cheap.

Without intentional management, observability cost grows faster than the systems it observes.

The cross-link that makes the stack valuable​

The PLTG stack's compound value comes from cross-linking. From a metric anomaly, click to logs from the same service in the same window. From a slow trace, click to the underlying logs. From an error log, click to the trace it's part of. Build the dashboards and the labels to support this; otherwise the three signals stay siloed.

One pattern we'd warn about​

Adopting OpenTelemetry without committing to its full stack. The OTel collector, instrumentation libraries, and protocol are excellent and they're the future. Half-adopting (instrumenting with OTel but routing to a non-OTel backend with a brittle bridge) is worse than picking either side fully.

What's your stack? And — for the on-prem folks — has anyone fully replaced commercial APM (Datadog, New Relic) with self-hosted open-source observability at industrial scale?
 

Forum statistics

Threads
171
Messages
178
Members
27
Latest member
AIORAli

Members online

No members online now.

Featured content

AIOR
AIOR TEKNOLOJİ

Tüm ihtiyaçlarınız için Teklif alın

Hosting · Domain · Sunucu · Tasarım · Yazılım · Mühendislik · Sektörel Çözümler

Teklif al

7/24 Destek · Anında yanıt

Back
Top