Retrieval Augmented Generation

Oct 22, 2023

How do LLMs incorporate private or real-time data? One strategy is retrieval augmented generation (RAG).

The idea: given a user query, first perform a search for the relevant context, then combine that context with the user query to generate an answer.

The problem: LLMs are limited with the context window of information they can process. Most models today can only accept around 4,000 tokens of context (about 3,000 words). Some models, like Anthropic’s Claude, can handle up to 100,000 (but that comes at the cost of quality, compute, and time).

There are a variety of methods for RAG that mostly center around similarity search on word embeddings. The idea behind similarity search is that documents that are semantically similar to the query will be relevant. Most retrieval methods end up using hybrid search (semantic search and traditional filtering).

Splitting documents up into chunks, modeling unstructured data, and exactly how to perform the search are very opinionated tasks, so there are plenty of frameworks that try to formalize these questions (LangChain, LlamaIndex, etc.).

What are the alternatives to RAG? Increased context length means that the search problem for relevant documents can be offloaded (more) to the LLM. Fine-tuning is another alternative, but only for offline processes (and still might require RAG).