LLMOps is just ops, with new failure modes[/HEADING>
The LLM in your application is a dependency. Like any dependency, it can change unexpectedly, fail, get expensive, drift. The patterns below are what we apply to keep LLM-powered features reliable in production.
Cost: the metric that surprises everyone[/HEADING>
LLM costs are usage-driven and easy to lose track of. The discipline:
- Per-feature attribution — every API call tagged with the feature/team that triggered it
- Daily / weekly cost dashboards — visible to engineering, not just finance
- Alert thresholds — paging on unexpected spikes
- Cost regression in CI — a prompt change that doubles token usage should be caught before deploy
Cost optimisation patterns:
- Cache identical queries
- Use smaller models for tasks they handle (Haiku vs Sonnet vs Opus, GPT-4o-mini vs full)
- Trim context — most production calls don't need maximum context
- Streaming saves nothing on cost; saves on perceived latency
- Prompt caching (where the API supports it) — meaningful savings on repeated context
Drift — the slow leak[/HEADING>
LLMs drift in two ways:
- Provider-driven — the model gets updated, behaviour changes subtly. Pinning to a specific version controls this; not pinning is rolling the dice.
- Distribution-driven — the inputs your system sees in production change over time, and the model's responses follow.
Both manifest as "the feature that worked last month is producing odd outputs this week". The detection:
- Run the eval set against production model regularly (daily or weekly)
- Track output statistics on production traffic — average output length, schema validation rate, refusal rate, sentiment distribution
- Sample user feedback — when you have a thumbs-up/down, watch the trend
Drift caught at week 1 is fixable. Drift caught at month 4 is a regression with months of degraded user experience.
Monitoring — what to log[/HEADING>
Per-call:
- Model version, parameters (temperature, max tokens, etc.)
- Full input prompt (or hash, for privacy)
- Full output
- Latency
- Token counts (input + output)
- Cost
- Error / refusal status
- Tool calls made
Per-feature:
- Daily call volume
- Daily / hourly latency p50 / p95 / p99
- Eval score (rolling)
- Cost
- User-feedback signal (thumbs, ratings, completion rate)
Rollback — assume it[/HEADING>
A new prompt or model version goes wrong. You need to roll back fast. Patterns:
- Feature-flag the model + prompt selection — change without redeploy
- A/B test new prompts before fully rolling out — small % first
- Canary deploys for prompt changes
- Documented rollback path — "to revert, change flag X to Y"
Latency budgets[/HEADING>
LLM calls are slow relative to typical web responses. Budget accordingly:
- 2-3 s for typical Sonnet / GPT-4-class single-turn
- Streaming makes UX bearable for longer responses; first-token-latency is the real perceived metric
- Multi-step agent flows compound: 5 calls × 2 s = 10 s perceived latency. Plan or structure differently.
- Cache + serve fast for repeat queries
Privacy and PII[/HEADING>
LLM API providers log calls. Some commit to no-training-on-customer-data; some don't. The discipline:
- Read the provider's data handling policy
- Strip PII before sending where possible
- Encrypt sensitive context server-side; only send the minimum
- Privacy review of LLM features as part of standard security review
For sensitive use cases (healthcare, financial), self-hosted open-weight models bypass the data-handling concern at the cost of operational overhead.
Multi-provider fallback[/HEADING>
The provider has an outage. The provider rate-limits you. The provider deprecates a model with 30 days notice. Patterns:
- Abstract the provider call behind your own interface
- Test against multiple providers periodically
- Have a fallback configured for "if provider A is down, try provider B"
- Accept that fallback quality may differ — fail open with a degraded but functional response
One pattern we'd warn about[/HEADING>
"We'll figure out monitoring later". The team that ships an LLM feature without observability is the team debugging it on Slack screenshots. Build the logging on day one.
One pattern that always pays off[/HEADING>
A weekly LLMOps review meeting. Cost trends, eval drift, top error patterns, top user-flagged outputs. The boring meeting is the one that catches problems early.
What's your LLM observability stack? And — has anyone successfully run multi-provider failover at production scale without quality degradation?
LLM costs are usage-driven and easy to lose track of. The discipline:
- Per-feature attribution — every API call tagged with the feature/team that triggered it
- Daily / weekly cost dashboards — visible to engineering, not just finance
- Alert thresholds — paging on unexpected spikes
- Cost regression in CI — a prompt change that doubles token usage should be caught before deploy
Cost optimisation patterns:
- Cache identical queries
- Use smaller models for tasks they handle (Haiku vs Sonnet vs Opus, GPT-4o-mini vs full)
- Trim context — most production calls don't need maximum context
- Streaming saves nothing on cost; saves on perceived latency
- Prompt caching (where the API supports it) — meaningful savings on repeated context
Drift — the slow leak[/HEADING>
LLMs drift in two ways:
- Provider-driven — the model gets updated, behaviour changes subtly. Pinning to a specific version controls this; not pinning is rolling the dice.
- Distribution-driven — the inputs your system sees in production change over time, and the model's responses follow.
Both manifest as "the feature that worked last month is producing odd outputs this week". The detection:
- Run the eval set against production model regularly (daily or weekly)
- Track output statistics on production traffic — average output length, schema validation rate, refusal rate, sentiment distribution
- Sample user feedback — when you have a thumbs-up/down, watch the trend
Drift caught at week 1 is fixable. Drift caught at month 4 is a regression with months of degraded user experience.
Monitoring — what to log[/HEADING>
Per-call:
- Model version, parameters (temperature, max tokens, etc.)
- Full input prompt (or hash, for privacy)
- Full output
- Latency
- Token counts (input + output)
- Cost
- Error / refusal status
- Tool calls made
Per-feature:
- Daily call volume
- Daily / hourly latency p50 / p95 / p99
- Eval score (rolling)
- Cost
- User-feedback signal (thumbs, ratings, completion rate)
Rollback — assume it[/HEADING>
A new prompt or model version goes wrong. You need to roll back fast. Patterns:
- Feature-flag the model + prompt selection — change without redeploy
- A/B test new prompts before fully rolling out — small % first
- Canary deploys for prompt changes
- Documented rollback path — "to revert, change flag X to Y"
Latency budgets[/HEADING>
LLM calls are slow relative to typical web responses. Budget accordingly:
- 2-3 s for typical Sonnet / GPT-4-class single-turn
- Streaming makes UX bearable for longer responses; first-token-latency is the real perceived metric
- Multi-step agent flows compound: 5 calls × 2 s = 10 s perceived latency. Plan or structure differently.
- Cache + serve fast for repeat queries
Privacy and PII[/HEADING>
LLM API providers log calls. Some commit to no-training-on-customer-data; some don't. The discipline:
- Read the provider's data handling policy
- Strip PII before sending where possible
- Encrypt sensitive context server-side; only send the minimum
- Privacy review of LLM features as part of standard security review
For sensitive use cases (healthcare, financial), self-hosted open-weight models bypass the data-handling concern at the cost of operational overhead.
Multi-provider fallback[/HEADING>
The provider has an outage. The provider rate-limits you. The provider deprecates a model with 30 days notice. Patterns:
- Abstract the provider call behind your own interface
- Test against multiple providers periodically
- Have a fallback configured for "if provider A is down, try provider B"
- Accept that fallback quality may differ — fail open with a degraded but functional response
One pattern we'd warn about[/HEADING>
"We'll figure out monitoring later". The team that ships an LLM feature without observability is the team debugging it on Slack screenshots. Build the logging on day one.
One pattern that always pays off[/HEADING>
A weekly LLMOps review meeting. Cost trends, eval drift, top error patterns, top user-flagged outputs. The boring meeting is the one that catches problems early.
What's your LLM observability stack? And — has anyone successfully run multi-provider failover at production scale without quality degradation?
Per-call:
- Model version, parameters (temperature, max tokens, etc.)
- Full input prompt (or hash, for privacy)
- Full output
- Latency
- Token counts (input + output)
- Cost
- Error / refusal status
- Tool calls made
Per-feature:
- Daily call volume
- Daily / hourly latency p50 / p95 / p99
- Eval score (rolling)
- Cost
- User-feedback signal (thumbs, ratings, completion rate)
Rollback — assume it[/HEADING>
A new prompt or model version goes wrong. You need to roll back fast. Patterns:
- Feature-flag the model + prompt selection — change without redeploy
- A/B test new prompts before fully rolling out — small % first
- Canary deploys for prompt changes
- Documented rollback path — "to revert, change flag X to Y"
Latency budgets[/HEADING>
LLM calls are slow relative to typical web responses. Budget accordingly:
- 2-3 s for typical Sonnet / GPT-4-class single-turn
- Streaming makes UX bearable for longer responses; first-token-latency is the real perceived metric
- Multi-step agent flows compound: 5 calls × 2 s = 10 s perceived latency. Plan or structure differently.
- Cache + serve fast for repeat queries
Privacy and PII[/HEADING>
LLM API providers log calls. Some commit to no-training-on-customer-data; some don't. The discipline:
- Read the provider's data handling policy
- Strip PII before sending where possible
- Encrypt sensitive context server-side; only send the minimum
- Privacy review of LLM features as part of standard security review
For sensitive use cases (healthcare, financial), self-hosted open-weight models bypass the data-handling concern at the cost of operational overhead.
Multi-provider fallback[/HEADING>
The provider has an outage. The provider rate-limits you. The provider deprecates a model with 30 days notice. Patterns:
- Abstract the provider call behind your own interface
- Test against multiple providers periodically
- Have a fallback configured for "if provider A is down, try provider B"
- Accept that fallback quality may differ — fail open with a degraded but functional response
One pattern we'd warn about[/HEADING>
"We'll figure out monitoring later". The team that ships an LLM feature without observability is the team debugging it on Slack screenshots. Build the logging on day one.
One pattern that always pays off[/HEADING>
A weekly LLMOps review meeting. Cost trends, eval drift, top error patterns, top user-flagged outputs. The boring meeting is the one that catches problems early.
What's your LLM observability stack? And — has anyone successfully run multi-provider failover at production scale without quality degradation?
LLM calls are slow relative to typical web responses. Budget accordingly:
- 2-3 s for typical Sonnet / GPT-4-class single-turn
- Streaming makes UX bearable for longer responses; first-token-latency is the real perceived metric
- Multi-step agent flows compound: 5 calls × 2 s = 10 s perceived latency. Plan or structure differently.
- Cache + serve fast for repeat queries
Privacy and PII[/HEADING>
LLM API providers log calls. Some commit to no-training-on-customer-data; some don't. The discipline:
- Read the provider's data handling policy
- Strip PII before sending where possible
- Encrypt sensitive context server-side; only send the minimum
- Privacy review of LLM features as part of standard security review
For sensitive use cases (healthcare, financial), self-hosted open-weight models bypass the data-handling concern at the cost of operational overhead.
Multi-provider fallback[/HEADING>
The provider has an outage. The provider rate-limits you. The provider deprecates a model with 30 days notice. Patterns:
- Abstract the provider call behind your own interface
- Test against multiple providers periodically
- Have a fallback configured for "if provider A is down, try provider B"
- Accept that fallback quality may differ — fail open with a degraded but functional response
One pattern we'd warn about[/HEADING>
"We'll figure out monitoring later". The team that ships an LLM feature without observability is the team debugging it on Slack screenshots. Build the logging on day one.
One pattern that always pays off[/HEADING>
A weekly LLMOps review meeting. Cost trends, eval drift, top error patterns, top user-flagged outputs. The boring meeting is the one that catches problems early.
What's your LLM observability stack? And — has anyone successfully run multi-provider failover at production scale without quality degradation?
The provider has an outage. The provider rate-limits you. The provider deprecates a model with 30 days notice. Patterns:
- Abstract the provider call behind your own interface
- Test against multiple providers periodically
- Have a fallback configured for "if provider A is down, try provider B"
- Accept that fallback quality may differ — fail open with a degraded but functional response
One pattern we'd warn about[/HEADING>
"We'll figure out monitoring later". The team that ships an LLM feature without observability is the team debugging it on Slack screenshots. Build the logging on day one.
One pattern that always pays off[/HEADING>
A weekly LLMOps review meeting. Cost trends, eval drift, top error patterns, top user-flagged outputs. The boring meeting is the one that catches problems early.
What's your LLM observability stack? And — has anyone successfully run multi-provider failover at production scale without quality degradation?
A weekly LLMOps review meeting. Cost trends, eval drift, top error patterns, top user-flagged outputs. The boring meeting is the one that catches problems early.
What's your LLM observability stack? And — has anyone successfully run multi-provider failover at production scale without quality degradation?