Deeper into RAG

In the previous post we walked through creating a RAG example, line by line. Let’s take a closer conceptual look at the steps involved in creating a RAG

Load your source documents, being careful to keep the meta-data
Split your document into semantically meaningful chunks
Convert your text to vector representations in an embedding model
Store your vector representations in a vector database
Retrieve the data you need

The first step is to load your document. Here you are taking the raw content and its metadata (e.g., creation date, etc.) and loading it into your application. Next, it is time to split the text, which is called chunking. It turns out that this is critical to creating a useful and fast RAG. Chunks are the atomic units used for retrieval, and to do this well, you’ll want to break your large documents into smaller semantically meaningful pieces.

Here the Goldilocks approach is critical. You don’t want your chunk to be so small that it has no context, nor do you want it to be so big that it covers more than one concept.

There are a few strategies for creating hunks. A simple example is to split your document by paragraphs. If that is impossible (the paragraphs are too long, etc.) you split by sentences, and failing that by lines.

If you are lucky enough to have structured data (e.g., headings, hierarchy, outline, etc.), then the splitter can use that structure to create chunks. If you are working with code, you might have the splitter chunk by classes, etc.

The next step is embedding. This is the most conceptually challenging task, though the tools will do the work for you. Conceptually you are creating a multi-dimensional map where the distance between the various chunks corresponds to how similar they are. This enables searching by semantic meaning rather than just keywords.

Notice that “speaking” and “speech” are close to each other, while “dog” and “keyboard” are further apart.

There are a number of embedding tools, some proprietary and some open source. The open-source models may be harder to set up, but they are free and can run locally.

Finally, we need to store the vectors in a vector database. The vectors are indexed, and searching typically uses Approximate Nearest Neighbor, taking advantage of the multi-dimensional model we created above.

That covers the ingestion workflow. Next is retrieval, which I’ll leave for the next blog post.