“RAG” stands for

retrieval-augmented generation

which is a fancy way of saying

a technique for giving a large language model new information, without retraining

The “without retraining” part is essential, as this allows one to avoid the timely, and expensive process, which often requires access to many high quality GPUs.

Privacy & Running Locally

While the technique of RAG can be applied by anyone, I’m primarily interested in using it in a private, and local way. This means one can leverage their data in an environment that they entirely trust. We’ve touched on running LLMs locally in a few previous documents in this series.

Use Cases

What excites me the most about RAG is the ability to leverage it for both personal and professional reasons.

I personally use Obsidian as my note taking application, which I’ve written about a fair bit before on this website. My favorite part about this application is that is simply stores your notes in local markdown files. Since you have all the data locally, this opens up a world of possibilities for custom integrations. In the spirit of this post, one of those possibilities is leveraging RAG so your LLM has all of your notes, and can leverage that information accordingly.

Professionally, I could see a use case where a company self hosts their own LLM, and pipes in their company, department and team’s documents. From say Confluence documents, to source code. As mentioned above, the of this approach would ensure that secrets are not exposed. Imagine a web server running that team mates could individually add documents, or even chat with the LLM via a Slackbot.

Theory

There are numerous ways to implement RAG, however the popular approach I’ve seen data ingestion looks like in pseudo code:

for each file f
  break f into chunks
  for each chunk c
    create an embedding for c
    store the embedding in a vector database

Which, visually, would look like this:

graph LR
    A[User] -->|Step 1. Passes relevant documents| B[REST API]
    B -->|Step 2. Breaks each file into chunks| B
    B -->|Step 3. Creates embeddings for each chunk| C[Embedding model]
    B -->|Step 4. Stores each embedding| D[Vector Database]

Embeddings are a technique used to to represent data by using a series of numbers, which we interpret as vectors. This has many uses in the natural language processing (NLP) space, but for now, just think of them as a way to encode our information in an easily retrievable format.

Then the approach for querying the LLM, and the new data, looks like this in pseudo code:

for each prompt p
  query the vector database for results r
  pass r, and p into the LLM to get LLM response rr
  return rr

Which, visually, would look like this:

graph LR
    A[User] -->|Step 1. Makes a query| B[REST API]
    B -->|Step 2. Searches for relevant documents| D[Vector Database]
    D -..->|Step 3. Returns relevant documents| B
    B -->|Step 4. Passes query and relevant documents| E[LLM]
    E -..->|Step 5. Returns response| B
    B -..->|Step 6. Returns reponse| A

This essentially works by searching our documents for anything relevant and injecting said relevant components of our documents into the LLM’s context window. Recall that a LLM’s context window is the amount of text, or more formally referred to as tokens, that an LLM can take in as input in a single query.

Alternatives

If you already know the document you are looking to ask an LLM about, it’s much simpler to simply provide that document to the LLM in your query. Say you have a small snippet of code, or a small dataset, simply pasting that into your query is a much simpler approach. RAG really begins to shine when you’re not entirely certain which document your asking about.

Another approach, which often is referred to as harder to implement but superior to RAG, is known as using a “knowledge graph”. A knowledge graph is using a graph data structure to not only store your data, but also the relations between them. If we look at the pseudo code in our Theory section above, we’d simply replace our vector database and the querying of it with a knowledge graph and the querying of that instead. This sounds simple in practice, but will definitely prove to be more work, as you have to encode the relations of the documents.

Why RAG Is Hard

RAG is often considered difficult as it’s fairly tunable, and a generic implementation will not work well for all data. You’ll first have to decide which documents are to be added, then decide how to chunk the documents. This chunking process is important and a delicate balance between performance, and relevance. Too small of a chunk size and your responses will take a long time, and your searches may not yield relevant information. Too large of a chunk size and you risk including irrelevant information in your queries, as well as experience increased performance processing unrelated data.

One problem with chunking one’s documents is the possible damage done to semantics. One common solution to this is to ensure that the chunks have overlaps. This increases your storage needs, but should improve retrieval.

Ollama Example

The best, and simplest example I’ve stumbled upon on setting up your own RAG is by the Ollama project, in their examples directory.

How it works is you provide the application a directory to your data. The example works by iterating over your data, and implementing a way to work with each file, based on its extension. The default supported extensions are:

  • .csv: CSV
  • .docx: Word Document
  • .doc: Word Document
  • .enex: EverNote
  • .eml: Email
  • .epub: EPub
  • .html: HTML File
  • .md: Markdown
  • .msg: Outlook Message
  • .odt: Open Document Text
  • .pdf: Portable Document Format (PDF)
  • .pptx : PowerPoint Document
  • .ppt : PowerPoint Document
  • .txt: Text file (UTF-8)

As mentioned above, the files are broken into chunks, and vectorized using an embedding model, which by default is all-MiniLM-L6-v2. These embeddings are then stored in a vector database, which by default is chroma db.

Moving Forward

As mentioned above, another approach for giving a LLM data of interest is to use what is called a “knowledge graph”. I may explore this in a future blog post.

Additionally, I may investigate refactoring the Ollama project’s example, as their code base is released under the MIT license. This, combined with say an open source embedding model, and LLM that both can be self hosted, would interest me a lot. I may code this up, and blog about it in the future.

References

As referenced above, this blog post was inspired by this lovely Ollama example.

There are numerous projects that support RAG out of the box. Recall that there are sever levers and knobs one can play with when working with RAG. Having not played with all of these, I’m unaware to what degree of personalization they afford you.