$ emrebener
home topics ai & ml rag explained: retrieval-augmented generation without the hype

RAG Explained: Retrieval-Augmented Generation Without the Hype

author: emre bener read time: 14 min about: retrieval-augmented generation, large language model
published: updated: mentions: word embedding, vector database, cosine similarity, nearest neighbor search, fine-tuning, hallucination

1. What RAG is, and the problem it actually solves

Retrieval-augmented generation (RAG) is a simple idea wearing an intimidating name: before the model answers, you find text relevant to the question and paste it into the prompt. The model then answers from that text instead of from memory alone. That’s the whole mechanism. Everything else (embeddings, vector databases, chunking strategies) is plumbing in service of that one move.

To see why the move is worth making, look at what a language model alone cannot do. Its knowledge is frozen at training time, so it knows nothing about events releases or documents that came after its cutoff. It has never seen your private data, your company wiki, your support tickets, or last Tuesday’s incident report. And when it doesn’t know something, it rarely says so. It produces a fluent, confident answer that happens to be wrong. Think of it as a reasoning engine with a fixed, public, slightly stale memory.

RAG addresses exactly one of those problems: access. It gives the model the right facts at the moment it needs them, pulled from a source you control and can update whenever you like. Want the model to answer questions about a document written this morning? Add the document to the retrieval store. No retraining, no fine-tuning, and nothing to wait for from the next model release.

It helps to be precise about what RAG does not do, because this is where the hype outruns the mechanism. RAG does not make the model smarter. It does not improve reasoning, math, or the model’s ability to follow instructions. It does not guarantee a correct answer. It changes what the model has in front of it, not what the model is capable of. If the model can reason well but lacks a fact, RAG closes that gap cleanly. If the model is bad at the underlying task, handing it more text just gives it more material to be wrong about.

There’s also an underrated benefit that has nothing to do with knowledge: attribution. Because the retrieved text comes from documents you can identify, a RAG system can cite its sources. The user gets an answer and a link to the paragraph it came from. For anything where trust matters (legal, medical, internal documentation, customer support) that traceability is often the real reason teams adopt RAG, more than the freshness.

2. The pipeline: indexing offline, retrieving at query time

A RAG system has two halves that run at completely different times. The first half, indexing, runs offline, ahead of any user question. The second half, retrieval and generation, runs online, once per query. Keeping these two phases separate in your head does most of the work of understanding the rest.

The indexing phase turns your corpus into something searchable. You take your documents, split each one into smaller passages (chunking, the subject of section 4), and pass every passage through an embedding model. The embedding model converts each passage into a vector, a list of numbers that encodes its meaning. You store those vectors, alongside the original text, in a vector database. This phase is slow and potentially expensive, but you pay the cost once, and re-pay only for documents that change.

The query phase reuses that index. When a user asks a question, you embed the question with the same embedding model, producing a query vector. You search the vector database for the stored passages whose vectors are closest to the query vector, and take the top few (the “top-k” results). Those passages are the retrieved context. You assemble a prompt that contains the user’s question plus the retrieved passages, hand it to the language model, and the model generates an answer grounded in that context.

Two details in that description matter more than they look. First, you have to use the same embedding model for both indexing and querying. Vectors from two different embedding models live in incompatible coordinate systems, and comparing them produces noise. Change your embedding model and you must re-index the entire corpus. Second, the language model never searches anything. It only ever sees the final assembled prompt. From the model’s point of view there is no “retrieval”; there is just a prompt that happens to contain some relevant text. The retrieval is something your system does before the model is involved at all.

That separation is also where most of a RAG system’s quality is won or lost. The model can only work with what retrieval hands it. If retrieval surfaces the wrong passages, the model has no way to know, and no way to recover. Good generation cannot rescue bad retrieval. This is why most of the rest of this post is about the retrieval half.

3. Embeddings and similarity search

Retrieval works by turning meaning into geometry. An embedding model maps a piece of text to a point in a high-dimensional space (typically a few hundred to a few thousand dimensions) such that texts with similar meaning land near each other. “How do I reset my password” and “I forgot my login credentials” share almost no words, but a good embedding model places them close together, because it was trained to encode meaning rather than vocabulary. That property is the entire reason RAG retrieval works on questions phrased differently from the documents that answer them.

“Close together” needs a precise definition, and the usual one is cosine similarity: the cosine of the angle between two vectors. It ranges from 1 (pointing the same direction, very similar) through 0 (perpendicular, unrelated) to -1 (opposite). For two vectors AA and BB:

cos(θ)=ABAB\cos(\theta) = \frac{A \cdot B}{\|A\|\,\|B\|}

Cosine similarity is preferred over plain straight-line distance because it ignores vector length and compares only direction. That matters because the direction of an embedding encodes its meaning, while its magnitude tends to track incidental things like passage length. Two passages about the same topic should count as similar whether one is a sentence and the other a paragraph.

Meaning as geometryQuery: "reset my password""I forgot my login credentials""office lunch menu"small anglewide angleCosine similarity = cosine of the angle. Smaller angle, closer meaning.Meaning as geometryQuery: "reset my password""I forgot my login credentials""office lunch menu"small anglewide angleCosine similarity = cosine of the angle. Smaller angle, closer meaning.

Retrieval, then, is a nearest-neighbor search: embed the query, find the stored vectors with the highest cosine similarity to it, return the passages they came from. With a few thousand passages you could compare against every one of them directly. At millions of passages that’s too slow, so vector databases use approximate nearest-neighbor (ANN) indexes, which trade a small, usually unnoticeable amount of recall for an enormous speedup. That tradeoff is what makes RAG over a large corpus practical at all.

The embedding model is doing the heavy lifting here, and that makes it a real choice rather than a default. A model trained mostly on general web text has only a fuzzy sense of specialized vocabulary. Feed it dense legal contracts, clinical notes, or code, and “similar” passages may not actually be the ones that answer the question, because the model never learned what closeness means in that domain. The embedding model defines the meaning of “relevant” for your entire system. Picking it casually is picking your retrieval quality casually.

4. Chunking: the decision that quietly decides everything

You cannot embed a whole document as one vector and expect good retrieval. A single vector has to summarize everything the document says, and a long document says too much for one point in space to represent. Squeeze a 40-page manual into one embedding and you get a blurry average of every topic it covers, close to everything and precise about nothing. So you split documents into smaller passages, called chunks, and embed each chunk separately. How you do that splitting is chunking, and it shapes retrieval quality more than any other single decision in the pipeline.

The core tension is chunk size. Small chunks of a sentence or two produce sharp, focused embeddings that match narrow questions precisely, but each chunk carries little surrounding context, so an answer that depends on the paragraph around it gets retrieved without that paragraph. Large chunks of a full section keep context intact but dilute the embedding, because the chunk now spans several subtopics and its vector is again an average. Push chunks larger still and you spend your prompt budget on retrieved text that is mostly irrelevant to the actual question.

Two failure modes follow directly from getting this wrong, and both are worth recognizing on sight. The first is the split answer: the information needed to answer the question is spread across two adjacent chunks, and retrieval surfaces only one. The chunk that mentions a configuration flag retrieves; the chunk three sentences later that explains what the flag does to production traffic does not. The model answers from half the story and has no idea the other half existed.

The split-answer failureSource documentChunk 1Chunk 2 — names the configflagChunk 3 — explains its effectChunk 4chunk boundaryone answer, spans two chunksRetrieved — top-kChunk 2 ✓ retrievedChunk 3 ✗ missedbelow cutoffThe model answers from half the story, and never knows the rest existed.The split-answer failureSource documentChunk 1Chunk 2 — names the configflagChunk 3 — explains its effectChunk 4chunk boundaryone answer, spans two chunksRetrieved — top-kChunk 2 ✓ retrievedChunk 3 ✗ missedbelow cutoffThe model answers from half the story, and never knows the rest existed.

The second is the orphaned chunk: a passage that made perfect sense in the document loses its meaning once cut out. A chunk that reads “This is not supported in versions prior to 3.0” is useless if the surrounding text that says what “this” refers to landed in a different chunk. The embedding of an orphaned chunk is misleading too, because the model embedded text whose meaning depended on context that is no longer attached.

The common mitigations all trade simplicity for context. Overlapping chunks, where each chunk repeats the last sentence or two of the previous one, reduce split answers at the cost of some duplication. Splitting on document structure rather than a fixed character count keeps semantically whole units together, since paragraphs and sections are already coherent. Attaching metadata to each chunk, such as the document title and the section heading it came from, gives an orphaned passage some of its context back. None of these is a default that always wins. The right chunking strategy depends on your documents, and the only reliable way to choose is to measure retrieval quality on real questions, not to reason about it in the abstract.

5. Where RAG breaks

RAG fails quietly. A broken RAG system does not throw an error; it produces a confident, fluent, well-formatted answer that is wrong, and it looks exactly like a working one. Knowing the specific failure modes is the only way to recognize trouble, because the output alone will not tell you. Here are the ones that recur.

5.1. Retrieval misses

The passage that answers the question exists in your corpus, but it isn’t in the top-k results. Maybe the question was phrased in a way the embedding model didn’t connect to the document, maybe k was set too low, maybe a near-duplicate chunk crowded it out. Whatever the cause, the model now answers without the one passage it needed, and because it doesn’t know the passage exists, it answers anyway, from general knowledge or from whatever weaker passages did retrieve.

5.2. Stale embeddings

A document changes, but its vector in the index still reflects the old text. Until you re-index that document, retrieval is searching yesterday’s corpus. This is easy to forget precisely because indexing is the offline half of the pipeline. Nothing about a normal query reminds you the index has drifted from the source, so the system happily retrieves and cites a passage that no longer says what the index thinks it says.

5.3. The model ignores the context

You retrieve the correct passage, paste it into the prompt, and the model answers from its own training data anyway, sometimes contradicting the passage you just handed it. This happens more when the retrieved text conflicts with something the model “knows” strongly, or when the prompt doesn’t clearly instruct the model to ground its answer in the provided context and say so when the context is insufficient. That last part is a prompting problem more than a retrieval one. Retrieval succeeded; generation undercut it.

5.4. Lost in the middle

Language models do not attend evenly across a long prompt. Information at the very start and very end of the context gets used reliably; information buried in the middle of a long block of retrieved passages is more likely to be skimmed over. Stuff twenty chunks into a prompt and the genuinely relevant one, sitting at position eleven, can effectively go unread. More retrieved context is not automatically better, and past a point it actively hurts.

5.5. Domain mismatch

As section 3 noted, a general-purpose embedding model has a weak grasp of specialized vocabulary. In a specialized corpus this surfaces as retrieval that looks plausible and is subtly wrong: the returned passages are about the right general area but not the specific point the question asked about. This one is especially insidious because the system never looks broken. It looks like it’s working and merely being unhelpful.

The thread connecting all of these: failures concentrate in the retrieval half, they’re invisible in the output, and the fix is almost always to measure retrieval directly rather than to judge the system by its answers. If you only ever look at final answers, you are debugging the last step of a pipeline whose problems are upstream.

6. RAG vs fine-tuning vs long context

RAG is one of three ways to get a model to handle knowledge or behavior it didn’t have out of the box, and they are routinely confused. They solve different problems, and the cleanest way to choose is to ask what kind of gap you’re closing.

ApproachBest forCost modelUpdating
RAGLarge or changing knowledge; answers that need source attributionIndexing + per-query retrievalAdd or re-index a document
Fine-tuningBehavior, tone, format, task-specific styleUp-front training runRetrain the model
Long contextA small, stable corpus that fits in the promptPer-query tokens, every queryEdit the text you paste in

Fine-tuning changes the model’s weights by training it further on your examples. It is the right tool for behavior: a consistent tone, a specific output format, a specialized task the base model performs awkwardly. It is the wrong tool for knowledge. Teaching a model new facts by fine-tuning is slow, expensive, hard to update, and prone to the same confident-hallucination problem, because the facts get blended into weights rather than kept as retrievable text. If the thing you need to add is knowledge that changes, fine-tuning is fighting the wrong fight.

Long context means skipping retrieval entirely and pasting the whole corpus into the prompt every time. Modern models have context windows large enough that for a small, stable body of text (a single handbook, one contract, a product’s documentation) this is genuinely the simplest correct choice. No vector database, no chunking, and nothing that can fail in retrieval because there is no retrieval. It stops working when the corpus is too big to fit, or large enough that paying for those tokens on every single query gets expensive, or when “lost in the middle” starts eating your accuracy. RAG is the answer to “the corpus no longer fits, or no longer fits cheaply.”

In practice these compose rather than compete. A mature system might fine-tune a model so it reliably answers in the right format and grounds itself in provided context, then use RAG to feed it current, attributable facts at query time. “RAG vs fine-tuning” is a false choice; “knowledge gap vs behavior gap” is the real distinction, and a system can have both.

7. When to reach for RAG, and where to go next

Reach for RAG when the gap is knowledge: the model needs facts it doesn’t have, those facts are too large or too fresh to bake into the model, and ideally you want answers traceable back to a source. That is the shape of problem RAG fits. Private document Q&A, support systems grounded in a knowledge base, anything answering questions about a corpus that changes faster than model releases do.

Don’t reach for RAG when the problem isn’t actually about missing knowledge. If the model reasons poorly on your task, RAG won’t help; that’s a model-capability gap. If you need a particular tone or output format, that’s fine-tuning’s job. If your corpus is small and stable, long context is simpler and has fewer failure modes. And if the model already knows the material well (general programming questions, common knowledge) RAG just adds latency, infrastructure, and new ways to fail in exchange for nothing. RAG is not free. It’s a retrieval pipeline you now own, operate, and debug.

This post stopped at the core pipeline on purpose, because the core pipeline is where most real systems live and most real problems start. But it is worth knowing what’s past it, if only by name. Reranking runs a second, more expensive model over the top-k retrieved passages to reorder them by relevance, catching cases where pure vector similarity ranked the wrong passage first. Hybrid search combines vector similarity with old-fashioned keyword search, which is still better at exact matches like error codes, names, and identifiers that embeddings tend to blur. Evaluation is the practice of measuring retrieval and answer quality systematically instead of by spot-checking, and given how quietly RAG fails (section 5), it is less optional than it sounds. GraphRAG and related approaches retrieve over a structured knowledge graph rather than a flat pile of chunks, which helps with questions that need to connect facts across multiple documents.

Each of those is a refinement of the same two-phase mechanism, not a replacement for it. The idea underneath stays the one from section 1: find the relevant text, put it in front of the model, let the model answer from it. Get retrieval right, measure it honestly, and be clear-eyed about the gap RAG actually closes. The hype around RAG oversells it as a way to make models smarter, which it isn’t. What it does is make sure a capable model is looking at the right page, and that turns out to be worth a great deal on its own.