The decision that wastes the most money
"Should we fine-tune?" is one of the most-asked, most-overdone decisions in LLM work. Fine-tuning is expensive, locks you into a snapshot of a model, and is often the wrong answer to a problem that prompting + RAG could have solved cheaper. Below is the framing we use.The hierarchy: cheaper to more expensive[/HEADING>
- Better prompting — almost always tried first. Cheap, reversible, fast.
- Few-shot examples — adds capability via context. Limited by context window cost.
- RAG — adds knowledge via retrieval. Decouples knowledge from model.
- Tool use — adds capability via external functions. Composes with all of the above.
- Fine-tuning — adapts the model. Expensive, slower to iterate, locks in a snapshot.
- Pretraining — almost never the right call for a product team.
Move down the list only when the previous step demonstrably can't solve the problem. Most projects we audit could have stayed at step 1-3.
When fine-tuning actually helps[/HEADING>
Fine-tuning earns its place when:
- Style / format consistency — the model needs to produce a specific output style across millions of calls, and prompting doesn't quite achieve it.
- Latency / cost on a smaller model — a fine-tuned smaller model can match a larger model on a narrow task at lower cost.
- Domain-specific behaviour — the model needs to handle terminology / patterns rare enough that prompting doesn't reliably produce them.
- Closed-domain classification / extraction — for narrow tasks with abundant labelled data, a fine-tuned model often outperforms prompting.
When fine-tuning is the wrong answer[/HEADING>
- "Adding knowledge" — fine-tuning is poor at injecting facts. RAG is the better tool. The model "knows" about the document at training time but doesn't reliably retrieve specifics.
- "Making the model better at reasoning" — pretrained models' reasoning capability is mostly fixed at the foundation level. Fine-tuning rarely improves it.
- "Customising the personality" — system prompt + few-shot does this 95 % of the way for less cost.
- "Fixing hallucinations" — fine-tuning doesn't reliably reduce hallucination. Validation, tool use, and RAG do.
The fine-tuning pipeline that works[/HEADING>
If you've decided fine-tuning is the right answer:
- Build a clean dataset of input-output pairs in the format the production system will use
- Hold out a test set — never train on it
- Run a baseline (prompted) eval on the test set to know what you're trying to beat
- Fine-tune (LoRA / QLoRA for cost-effective parameter-efficient tuning, full fine-tune for the rare cases where it's needed)
- Evaluate the fine-tuned model on the test set; compare to baseline
- Run on production-distribution data, not just the curated test set
- Plan for re-tuning when the base model upgrades
LoRA / QLoRA dominates parameter-efficient fine-tuning in 2026. Full fine-tunes are rare for product use cases.
Open-weight vs closed
- Closed (Anthropic, OpenAI, Google) — fine-tuning available on some models, less control, no offline deployment, simpler ops.
- Open weights (Llama, Qwen, Mistral, etc.) — full control, can run anywhere, more engineering investment, longer-term cost predictable.
If the use case justifies the engineering investment, open-weight + LoRA fine-tuning is dramatically cheaper at scale. If you don't have GPU operations capability and the use case is moderate-volume, the closed-model fine-tuning APIs are the path of least resistance.
Data quality dominates everything[/HEADING>
The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.
One pattern we'd warn about[/HEADING>
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.
One pattern that always pays off[/HEADING>
Comparing fine-tune-of-small-model against prompted-larger-model for cost / latency / quality across the test set. Often the latter wins. When the former wins, you have a real reason for the fine-tune.
What's your decision process for when to fine-tune? And — controversial — has anyone fine-tuned a model that produced a result they couldn't have achieved with better prompting?
Fine-tuning earns its place when:
- Style / format consistency — the model needs to produce a specific output style across millions of calls, and prompting doesn't quite achieve it.
- Latency / cost on a smaller model — a fine-tuned smaller model can match a larger model on a narrow task at lower cost.
- Domain-specific behaviour — the model needs to handle terminology / patterns rare enough that prompting doesn't reliably produce them.
- Closed-domain classification / extraction — for narrow tasks with abundant labelled data, a fine-tuned model often outperforms prompting.
When fine-tuning is the wrong answer[/HEADING>
- "Adding knowledge" — fine-tuning is poor at injecting facts. RAG is the better tool. The model "knows" about the document at training time but doesn't reliably retrieve specifics.
- "Making the model better at reasoning" — pretrained models' reasoning capability is mostly fixed at the foundation level. Fine-tuning rarely improves it.
- "Customising the personality" — system prompt + few-shot does this 95 % of the way for less cost.
- "Fixing hallucinations" — fine-tuning doesn't reliably reduce hallucination. Validation, tool use, and RAG do.
The fine-tuning pipeline that works[/HEADING>
If you've decided fine-tuning is the right answer:
- Build a clean dataset of input-output pairs in the format the production system will use
- Hold out a test set — never train on it
- Run a baseline (prompted) eval on the test set to know what you're trying to beat
- Fine-tune (LoRA / QLoRA for cost-effective parameter-efficient tuning, full fine-tune for the rare cases where it's needed)
- Evaluate the fine-tuned model on the test set; compare to baseline
- Run on production-distribution data, not just the curated test set
- Plan for re-tuning when the base model upgrades
LoRA / QLoRA dominates parameter-efficient fine-tuning in 2026. Full fine-tunes are rare for product use cases.
Open-weight vs closed
- Closed (Anthropic, OpenAI, Google) — fine-tuning available on some models, less control, no offline deployment, simpler ops.
- Open weights (Llama, Qwen, Mistral, etc.) — full control, can run anywhere, more engineering investment, longer-term cost predictable.
If the use case justifies the engineering investment, open-weight + LoRA fine-tuning is dramatically cheaper at scale. If you don't have GPU operations capability and the use case is moderate-volume, the closed-model fine-tuning APIs are the path of least resistance.
Data quality dominates everything[/HEADING>
The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.
One pattern we'd warn about[/HEADING>
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.
One pattern that always pays off[/HEADING>
Comparing fine-tune-of-small-model against prompted-larger-model for cost / latency / quality across the test set. Often the latter wins. When the former wins, you have a real reason for the fine-tune.
What's your decision process for when to fine-tune? And — controversial — has anyone fine-tuned a model that produced a result they couldn't have achieved with better prompting?
If you've decided fine-tuning is the right answer:
- Build a clean dataset of input-output pairs in the format the production system will use
- Hold out a test set — never train on it
- Run a baseline (prompted) eval on the test set to know what you're trying to beat
- Fine-tune (LoRA / QLoRA for cost-effective parameter-efficient tuning, full fine-tune for the rare cases where it's needed)
- Evaluate the fine-tuned model on the test set; compare to baseline
- Run on production-distribution data, not just the curated test set
- Plan for re-tuning when the base model upgrades
LoRA / QLoRA dominates parameter-efficient fine-tuning in 2026. Full fine-tunes are rare for product use cases.
Open-weight vs closed
- Closed (Anthropic, OpenAI, Google) — fine-tuning available on some models, less control, no offline deployment, simpler ops.
- Open weights (Llama, Qwen, Mistral, etc.) — full control, can run anywhere, more engineering investment, longer-term cost predictable.
If the use case justifies the engineering investment, open-weight + LoRA fine-tuning is dramatically cheaper at scale. If you don't have GPU operations capability and the use case is moderate-volume, the closed-model fine-tuning APIs are the path of least resistance.
Data quality dominates everything[/HEADING>
The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.
One pattern we'd warn about[/HEADING>
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.
One pattern that always pays off[/HEADING>
Comparing fine-tune-of-small-model against prompted-larger-model for cost / latency / quality across the test set. Often the latter wins. When the former wins, you have a real reason for the fine-tune.
What's your decision process for when to fine-tune? And — controversial — has anyone fine-tuned a model that produced a result they couldn't have achieved with better prompting?
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.