Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Aior · Thursday at 11:50 PM

Three signals, one stack

Modern observability is metrics + logs + traces. Each tells you something the others can't, and the value compounds when they're cross-linked. The Prometheus + Loki + Tempo + Grafana ("PLTG") stack has emerged as the open-source default in 2026, and for good reason — it's mature, integrated, and the pricing model (you operate it) is honest.

Below is what we ship.

Metrics: Prometheus + the questions to ask

Prometheus is the de-facto metrics tool for cloud-native infrastructure. Cardinal questions:

What's your retention requirement? Local Prometheus = 15 days default. Long-term (months / years) needs Thanos or Mimir or Cortex. Pick early; migration is non-trivial.
What's your cardinality budget? Each unique label combination is a time series; high-cardinality labels (user_id, request_id) blow up storage. Discipline at instrumentation time pays off forever.
Pull or push? Prometheus is pull-based by default. For ephemeral jobs that don't live long enough to be scraped, the Pushgateway is the workaround.

Logs: Loki — when to use it, when not to

Loki is the log-aggregation companion. Indexes labels but not log content. The trade-off:

Where it wins — when you query by a small set of labels (service, environment, level) and grep within. Cost-effective at scale.
Where it hurts — full-text search across years of logs is slower than purpose-built log search (ELK, Splunk).

For most infrastructure logs, Loki is right. For full-text-search-heavy use cases (security forensics on years of logs), Elasticsearch / OpenSearch is still preferred.

Traces: Tempo — for distributed systems

Tempo (or Jaeger, or OpenTelemetry-compatible alternatives) gives you per-request latency breakdowns across service boundaries. The questions:

Sampling — keeping every trace gets expensive. Head sampling (decide at the start) misses interesting traces; tail sampling (decide after) is more useful but harder to operate.
Instrumentation — auto-instrumentation libraries cover most frameworks. Manual instrumentation for the spans that matter operationally.
Cross-service correlation — propagate trace context (W3C Trace Context) at every service boundary. Without it, traces are per-service and lose the cross-service value.

The dashboard discipline

A dashboard wall is a sign of an organisation that doesn't know what it's looking at. The structure that works:

Per-service operational dashboards — RED metrics (Rate, Errors, Duration) for the service. Owned by the team.
Per-feature dashboards — for a critical user-facing feature, what's its end-to-end health.
SLO / error budget dashboards — for the SLOs the team has committed to.
Org-level overviews — for leadership, the few KPIs that matter at company level.

Each dashboard has a documented audience and a documented "if you see X, do Y" runbook reference.

Alerting that doesn't burn the team out

The alert philosophy that works:

Alert on symptoms, not causes. "User-facing latency exceeded SLO" not "CPU is at 80 %".
Each alert points to a runbook.
Each alert has an owner.
Each alert has tested escalation paths.
Quarterly alert review — anything that fires often without action gets tuned or deleted.

The single biggest predictor of on-call quality of life: alert hygiene. Teams with quiet, actionable on-call have done the discipline; teams with constantly-paging on-call haven't.

The cost of observability

At scale, observability becomes a meaningful cost line. The patterns:

Sample / drop high-cardinality metrics that aren't actionable
Drop debug-level logs from production unless an investigation is open
Tail-sample traces, keeping the slow ones and a baseline of normal ones
Tier storage — hot for recent, cold for old. S3-backed cold tier is cheap.

Without intentional management, observability cost grows faster than the systems it observes.

The cross-link that makes the stack valuable

The PLTG stack's compound value comes from cross-linking. From a metric anomaly, click to logs from the same service in the same window. From a slow trace, click to the underlying logs. From an error log, click to the trace it's part of. Build the dashboards and the labels to support this; otherwise the three signals stay siloed.

One pattern we'd warn about

Adopting OpenTelemetry without committing to its full stack. The OTel collector, instrumentation libraries, and protocol are excellent and they're the future. Half-adopting (instrumenting with OTel but routing to a non-OTel backend with a brittle bridge) is worse than picking either side fully.

What's your stack? And — for the on-prem folks — has anyone fully replaced commercial APM (Datadog, New Relic) with self-hosted open-source observability at industrial scale?

Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Aior

Administrator

Three signals, one stack

Metrics: Prometheus + the questions to ask

Logs: Loki — when to use it, when not to

Traces: Tempo — for distributed systems

The dashboard discipline

Alerting that doesn't burn the team out

The cost of observability

The cross-link that makes the stack valuable

One pattern we'd warn about

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Observability with Prometheus, Loki, Tempo, Grafana: building a stack that survives growth

Aior

Administrator

Three signals, one stack​

Metrics: Prometheus + the questions to ask​

Logs: Loki — when to use it, when not to​

Traces: Tempo — for distributed systems​

The dashboard discipline​

Alerting that doesn't burn the team out​

The cost of observability​

The cross-link that makes the stack valuable​

One pattern we'd warn about​

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

Three signals, one stack

Metrics: Prometheus + the questions to ask

Logs: Loki — when to use it, when not to

Traces: Tempo — for distributed systems

The dashboard discipline

Alerting that doesn't burn the team out

The cost of observability

The cross-link that makes the stack valuable

One pattern we'd warn about