The shape of an LLM application that survives a year
There's a very specific kind of LLM application that ships, gets used, and is still running a year later — the one where the model is a component in a system, not the system itself. Below are the patterns we've seen consistently produce that outcome.RAG (retrieval-augmented generation): when it's right and when it isn't
The default architecture for "use the LLM to answer questions about my documents". Fundamentals:- Chunk the corpus into retrievable units (paragraphs, sections, document fragments)
- Embed the chunks; store in a vector database
- On a query, embed the query, retrieve top-k chunks
- Pass query + retrieved context to the LLM, ask it to answer with citations
Where it works: factual Q&A over a defined corpus, especially when the corpus is large and changes often.
Where it doesn't:
- Tasks requiring synthesis across many chunks — naive RAG retrieves locally relevant chunks, misses the global picture
- Tasks where the answer requires reasoning about the corpus structure (e.g. "how often does this term appear")
- Cases where the user's query doesn't lexically match the documents — retrieval misses the right chunks
Mitigations: hybrid retrieval (lexical + semantic), reranking, query rewriting, hierarchical / graph-based RAG. Each adds engineering cost.
Tool use is the more durable pattern
Letting the model call tools (functions) is increasingly the architecture that holds up. The model becomes the orchestration layer; the tools do the actual work.The pattern:
- Define a small, well-named set of tools — search, fetch, compute, write
- Each tool has a strict input schema and a structured output
- The model is given the toolset and instructed to use them when relevant
- The application validates and executes the tool calls; results go back to the model
Tool use survives model upgrades better than RAG-only architectures. The application's value is in the tools (which you own) and the orchestration (which the model handles).
Agents — the careful version
Multi-step agents (model plans, executes, evaluates, replans) are useful for narrow domains where the cost of incorrect autonomy is bounded:- Code generation with test feedback (write code, run tests, fix failures)
- Data analysis with iterative queries (ask, examine, refine)
- Customer support triage (gather information, classify, route)
Where they fail:
- Long-horizon tasks without good intermediate signals — the agent drifts off-task
- Tasks where every step has irreversible side effects — the cost of a bad step compounds
- Tasks where the user expects determinism — agents are inherently non-deterministic
The moat — what's actually defensible
The model is not your moat. The model upgrades for everyone simultaneously. What is defensible:- Proprietary data and the right to use it — the corpus and its rights structure
- Domain-specific evaluation — the eval set that lets you ship reliably in your domain
- Workflow integration — the user's existing tools, processes, deployments
- Trust and accountability — being the company that takes responsibility when the model is wrong
- Enterprise-grade plumbing — auth, audit, compliance, multi-tenancy
The LLM application that competes on "we have the best prompt" is the one that loses next quarter to someone with the same prompt and a better business.
The cost conversation[/HEADING>
LLM costs are real and they scale with usage:
- Cache aggressively — same query, same answer, no API call
- Right-size the model — don't use the most powerful model for tasks the cheaper model handles
- Limit the context window — most production calls don't need 200k tokens
- Stream where the user is waiting; batch where they're not
- Track per-feature spend — cost attribution is essential
One pattern we'd warn about
The "wrap everything in an LLM" temptation. If a deterministic algorithm can do the job, use it. LLMs for the parts that genuinely need natural language understanding; SQL / code / regex for everything else.
One pattern that always pays off[/HEADING>
Logging the full conversation (input, intermediate steps, output, model version, latency, cost) for every production call. Enables eval set construction, regression debugging, and cost optimisation. Storage is cheap; replayability is gold.
What's your LLM stack? And — for the agent folks — what's the longest-horizon task you've had reliably autonomous in production?
Logging the full conversation (input, intermediate steps, output, model version, latency, cost) for every production call. Enables eval set construction, regression debugging, and cost optimisation. Storage is cheap; replayability is gold.
What's your LLM stack? And — for the agent folks — what's the longest-horizon task you've had reliably autonomous in production?