Fine-tuning vs RAG vs prompting: when each is the right tool

Aior · Friday at 1:42 AM

The decision that wastes the most money

"Should we fine-tune?" is one of the most-asked, most-overdone decisions in LLM work. Fine-tuning is expensive, locks you into a snapshot of a model, and is often the wrong answer to a problem that prompting + RAG could have solved cheaper. Below is the framing we use.

The hierarchy: cheaper to more expensive[/HEADING>

Better prompting — almost always tried first. Cheap, reversible, fast.

Few-shot examples — adds capability via context. Limited by context window cost.

RAG — adds knowledge via retrieval. Decouples knowledge from model.

Tool use — adds capability via external functions. Composes with all of the above.

Fine-tuning — adapts the model. Expensive, slower to iterate, locks in a snapshot.

Pretraining — almost never the right call for a product team.

Move down the list only when the previous step demonstrably can't solve the problem. Most projects we audit could have stayed at step 1-3.

When fine-tuning actually helps[/HEADING>
Fine-tuning earns its place when:

Style / format consistency — the model needs to produce a specific output style across millions of calls, and prompting doesn't quite achieve it.

Latency / cost on a smaller model — a fine-tuned smaller model can match a larger model on a narrow task at lower cost.

Domain-specific behaviour — the model needs to handle terminology / patterns rare enough that prompting doesn't reliably produce them.

Closed-domain classification / extraction — for narrow tasks with abundant labelled data, a fine-tuned model often outperforms prompting.

When fine-tuning is the wrong answer[/HEADING>

"Adding knowledge" — fine-tuning is poor at injecting facts. RAG is the better tool. The model "knows" about the document at training time but doesn't reliably retrieve specifics.

"Making the model better at reasoning" — pretrained models' reasoning capability is mostly fixed at the foundation level. Fine-tuning rarely improves it.

"Customising the personality" — system prompt + few-shot does this 95 % of the way for less cost.

"Fixing hallucinations" — fine-tuning doesn't reliably reduce hallucination. Validation, tool use, and RAG do.

The fine-tuning pipeline that works[/HEADING>
If you've decided fine-tuning is the right answer:

Build a clean dataset of input-output pairs in the format the production system will use

Hold out a test set — never train on it

Run a baseline (prompted) eval on the test set to know what you're trying to beat

Fine-tune (LoRA / QLoRA for cost-effective parameter-efficient tuning, full fine-tune for the rare cases where it's needed)

Evaluate the fine-tuned model on the test set; compare to baseline

Run on production-distribution data, not just the curated test set

Plan for re-tuning when the base model upgrades

LoRA / QLoRA dominates parameter-efficient fine-tuning in 2026. Full fine-tunes are rare for product use cases.

Open-weight vs closed

Closed (Anthropic, OpenAI, Google) — fine-tuning available on some models, less control, no offline deployment, simpler ops.
Open weights (Llama, Qwen, Mistral, etc.) — full control, can run anywhere, more engineering investment, longer-term cost predictable.

If the use case justifies the engineering investment, open-weight + LoRA fine-tuning is dramatically cheaper at scale. If you don't have GPU operations capability and the use case is moderate-volume, the closed-model fine-tuning APIs are the path of least resistance.

Data quality dominates everything[/HEADING>
The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.

One pattern we'd warn about[/HEADING>
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.

One pattern that always pays off[/HEADING>
Comparing fine-tune-of-small-model against prompted-larger-model for cost / latency / quality across the test set. Often the latter wins. When the former wins, you have a real reason for the fine-tune.

What's your decision process for when to fine-tune? And — controversial — has anyone fine-tuned a model that produced a result they couldn't have achieved with better prompting?

Fine-tuning vs RAG vs prompting: when each is the right tool

Fine-tuning vs RAG vs prompting: when each is the right tool

Aior

Administrator

The decision that wastes the most money

Open-weight vs closed

Data quality dominates everything[/HEADING>
The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.

One pattern we'd warn about[/HEADING>
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Fine-tuning vs RAG vs prompting: when each is the right tool

Fine-tuning vs RAG vs prompting: when each is the right tool

Aior

Administrator

The decision that wastes the most money​

Open-weight vs closed​

Data quality dominates everything[/HEADING> The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.

One pattern we'd warn about[/HEADING> Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

The decision that wastes the most money

Open-weight vs closed

Data quality dominates everything[/HEADING>
The fine-tuning result is bounded by the dataset. A 1 000-example clean dataset typically beats a 10 000-example noisy one. Investing in dataset curation is the highest-leverage activity in any fine-tuning project.

One pattern we'd warn about[/HEADING>
Fine-tuning early in a project before evals exist. Without evals, you can't tell whether the fine-tune helped, hurt, or made the model brittle in unmeasured dimensions. Evals first, fine-tune second.