Retrieval-Augmented Generation

TL;DR

Retrieval-Augmented Generation (RAG) is a pattern that gives a language model access to a knowledge base at query time — the model retrieves relevant documents, then generates its answer using those documents as context. Use RAG when your LLM needs to answer questions about private, proprietary, or frequently updated information that was not part of its training data.

Quick facts:

RAG = retrieval step + generation step combined in one pipeline
Solves the "knowledge cutoff" and "hallucination on private data" problems
Works with any LLM — no fine-tuning required
The knowledge base is a vector database (embeddings of your documents)
Retrieval quality determines answer quality — garbage in, garbage out
Cost scales with document count and query volume, not model size

What Is RAG?

Standard LLMs know only what was in their training data. Ask Claude or GPT-4 about your internal product docs, last week's support tickets, or a PDF you uploaded yesterday, and they either hallucinate or admit they do not know. RAG fixes this by turning the model into a search-powered answering engine.

The pipeline has two phases:

Indexing (offline): Split your documents into chunks, convert each chunk into a vector embedding using an embedding model, and store those vectors in a vector database. This runs once when you load new data.

Querying (online): When a user asks a question, embed the question using the same embedding model, search the vector database for the most similar chunks (top-k retrieval), inject those chunks into the LLM prompt as context, and let the model generate an answer grounded in the retrieved text.

The model never "remembers" the documents — it reads them freshly on every query. That means updates are instant: re-index a document and the next query sees the new version.

Vector Database Comparison

| Database | Hosting | Scale | Hybrid Search | Best For | |----------|---------|-------|---------------|----------| | Pinecone | Managed cloud | Billions of vectors | Yes | Production SaaS, minimal ops | | Weaviate | Self-hosted / cloud | Large | Yes | Complex schemas, multi-modal | | Qdrant | Self-hosted / cloud | Large | Yes | High-performance, Rust-based | | Chroma | Local / self-hosted | Small–medium | No | Local dev, prototyping | | pgvector | PostgreSQL extension | Medium | Yes | Teams already running Postgres | | LlamaIndex | Library (any backend) | Depends on backend | Depends | Rapid prototyping, framework abstraction |

Recommendation: Start with Chroma locally for prototyping — zero setup, Python-native. Move to Pinecone or pgvector for production. If your stack already runs PostgreSQL (like this platform), pgvector adds RAG with no new infrastructure.

When to Use RAG vs. Other Approaches

| Scenario | Recommended Approach | |----------|----------------------| | Answer questions about a static PDF | RAG with a vector store | | Summarize yesterday's news articles | RAG with a freshness filter | | Improve model tone or output format | Fine-tuning | | Answer questions about public facts the model knows | Plain LLM prompt (no RAG needed) | | Chat over a 10-page document | Stuff the whole doc into context | | Chat over 10,000 support tickets | RAG (too large for context window) | | Private codebase Q&A | RAG with code-aware chunking | | Real-time sensor data | Streaming + LLM, not RAG |

FAQ

How is RAG different from fine-tuning? Fine-tuning bakes knowledge into the model's weights — it requires training compute and cannot be updated without re-training. RAG keeps knowledge external and updatable at any time. Fine-tune for style and behavior; use RAG for factual knowledge that changes.

How large should my chunks be? A common starting point is 256–512 tokens with a 10–20% overlap between adjacent chunks. Smaller chunks improve retrieval precision but lose surrounding context; larger chunks preserve context but reduce precision. Experiment with your data — there is no universal answer.

How many chunks should I retrieve (top-k)? Start with k=3 to k=5. More chunks provide more context but consume more tokens and can dilute the answer if irrelevant chunks are included. Use a relevance score threshold to drop chunks below a minimum similarity score.

Can RAG hallucinate? Yes. The LLM generates its answer from retrieved text, but it can still misread, mis-summarize, or extrapolate beyond what the retrieved chunks actually say. Mitigate this by instructing the model to cite sources and answer only from the provided context.

Does RAG work with agents? Perfectly. RAG is typically implemented as one tool in an agent's tool set — the agent calls a search_knowledge_base function and receives retrieved chunks as a tool result. This lets the agent decide when retrieval is needed rather than retrieving on every turn. See AI Agents Explained for the agent loop pattern.

Retrieval-Augmented Generation

Retrieval-Augmented Generation

TL;DR

What Is RAG?

Vector Database Comparison

When to Use RAG vs. Other Approaches

FAQ

Further Reading