I'll be honest—RAG (Retrieval-Augmented Generation) sounds simple on paper. Grab relevant documents, feed them to an LLM, get better answers. Easy, right? Well, after building a dozen or so production RAG systems, I can tell you there's a lot more nuance than the tutorials let on. Here's what actually works.
Why RAG Matters (and Why It's Harder Than You Think)
The promise of RAG is compelling: give your LLM access to your private data without retraining or fine-tuning. Point it at your documentation, policies, research papers—whatever—and suddenly ChatGPT knows about your specific business context. In practice though, getting this to work well requires careful attention to details that aren't obvious upfront.
I've seen plenty of teams rush into RAG thinking it'll solve all their problems, only to end up with a system that returns irrelevant documents 40% of the time. The fundamentals matter here more than in almost any other LLM application.
The Core Components (What You're Actually Building)
Every RAG system has the same basic architecture, though implementations vary wildly:
The RAG Pipeline
- Document Ingestion → Load and preprocess your source documents
- Chunking → Split documents into manageable pieces
- Embedding → Convert text chunks to vector representations
- Storage → Index vectors in a database (Pinecone, Weaviate, etc.)
- Retrieval → Find relevant chunks based on user query
- Generation → Feed retrieved context to LLM for final answer
Sounds straightforward. But each of these steps has gotchas that'll bite you in production if you're not careful.
Document Chunking: The Foundation That Everyone Gets Wrong
This is where most RAG systems fail. Chunking determines what information the LLM sees, and bad chunking means even the best retrieval won't save you.
Chunk Size: The Goldilocks Problem
Too small (128-256 tokens) and you lose context. A chunk might say "the policy changed in 2024" but not mention which policy or what changed. Too large (2000+ tokens) and you're stuffing irrelevant information into your prompt, wasting tokens and confusing the model.
After extensive testing, we've landed on these guidelines:
- General documents: 400-600 tokens with 50-100 token overlap
- Technical documentation: 600-800 tokens (needs more context)
- FAQ/Q&A content: 200-300 tokens (more focused chunks work better)
- Legal/contracts: 800-1000 tokens (can't break context mid-clause)
That overlap is crucial, by the way. Without it, important information that spans chunk boundaries gets lost. I've seen this cause weird edge cases where the system can't answer questions that bridge two chunks.
Smart Chunking Strategies
Don't just split on character count. Use structure when you have it:
# Better chunking respects document structure
chunks = []
# For markdown documents
for section in document.sections:
if len(section) < 600:
chunks.append(section) # Keep small sections whole
else:
# Split larger sections at paragraph boundaries
chunks.extend(split_on_paragraphs(section, target=500))
# Add metadata to each chunk
for chunk in chunks:
chunk.metadata = {
"source": document.title,
"section": section.heading,
"page": get_page_number(chunk)
} The metadata is super important. When you retrieve chunks later, you want to know where they came from. Plus, you can use metadata for filtering (only search within certain document types, dates, etc.).
Choosing Your Embedding Model
This is less critical than chunking, but still matters. The embedding model converts your text into vectors that capture semantic meaning.
| Model | Dimensions | Cost | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02/1M tokens | General purpose, best value |
| OpenAI text-embedding-3-large | 3072 | $0.13/1M tokens | Higher accuracy needs |
| Cohere embed-english-v3 | 1024 | $0.10/1M tokens | Strong semantic search |
| sentence-transformers | 384-768 | Free (self-hosted) | Budget constraints |
Honestly? For most use cases, OpenAI's text-embedding-3-small is perfectly fine. It's cheap, fast, and performs well. We only reach for the larger models when clients have very specific accuracy requirements or unusual domain language.
One thing to watch out for: whichever embedding model you choose, you're stuck with it unless you want to re-embed everything. Switching models means rebuilding your entire vector database. Plan accordingly.
Retrieval Strategies That Actually Work
Simple similarity search works okay, but you can do way better with a few tweaks.
Hybrid Search: Best of Both Worlds
Pure vector search misses exact matches sometimes. If someone asks "What's the policy on remote work?" and your document says "remote work policy," keyword search will catch that instantly. But vector search handles "What's our WFH policy?" better because it understands the semantic similarity.
Combine both:
# Hybrid search approach
def retrieve_chunks(query, top_k=5):
# Get semantic matches
vector_results = vector_db.search(
embedding=embed(query),
limit=top_k * 2 # Get more than we need
)
# Get keyword matches
keyword_results = bm25_search(
query=query,
limit=top_k * 2
)
# Combine and rerank
combined = merge_and_rerank(
vector_results,
keyword_results,
weights=[0.7, 0.3] # Favor semantic
)
return combined[:top_k] The exact weights depend on your use case, but starting with 70% vector / 30% keyword works well for most applications. Adjust based on what you see in production.
Metadata Filtering: The Underrated Feature
This is huge and often overlooked. Let's say you have documents from different departments, time periods, or document types. You don't want to search everything every time.
# Filter by metadata before searching
results = vector_db.search(
embedding=embed(query),
filter={
"department": "Engineering",
"date": {"$gte": "2024-01-01"},
"document_type": {"$in": ["policy", "guideline"]}
},
limit=5
) This makes results way more relevant and speeds up queries. Plus, it helps with data governance—you can enforce who sees what at the retrieval layer.
The LLM Generation Step: Prompt Engineering for RAG
You've retrieved relevant chunks. Now what? The prompt you feed to the LLM determines whether users get good answers or hallucinated nonsense.
A Production-Ready RAG Prompt Template
system_prompt = """You are a helpful assistant that answers questions
based on the provided context. Follow these rules strictly:
1. Only use information from the context below to answer questions
2. If the context doesn't contain the answer, say "I don't have enough
information to answer that" - do NOT make up information
3. Cite which document each piece of information comes from
4. If the context is contradictory, mention both perspectives
Context:
{retrieved_chunks}
Remember: Only answer based on the context above."""
user_prompt = """Question: {user_question}
Please provide a comprehensive answer based on the context provided.""" A few things I've learned the hard way about RAG prompts:
- Be explicit about not hallucinating. LLMs want to be helpful and will make stuff up if you let them
- Include source citation. Users need to verify answers, especially in regulated industries
- Handle contradictions gracefully. Your document set might have conflicting information
- Format matters. Clearly delineate where context ends and the question begins
Common RAG Pitfalls (and How to Avoid Them)
1. The "Lost in the Middle" Problem
LLMs pay more attention to information at the start and end of the context. Stuff in the middle gets ignored. If you're feeding 10 chunks to the model, consider reranking them so the most relevant appears first and last.
2. Chunk Retrieval Without Context
You retrieve a chunk that says "This replaces the previous policy." Great, but which policy? The original chunk didn't include that context.
Solutions:
- Include surrounding chunks in the retrieval (grab chunk N-1 and N+1)
- Add document/section headers to each chunk
- Use larger chunk sizes for documents where context matters
3. Outdated Information
Your vector database doesn't automatically update when source documents change. Build a refresh pipeline and version your chunks so you can track what's current.
4. Query-Document Mismatch
Sometimes user questions and document phrasing are too different for good retrieval. "What's our PTO policy?" might not match a document titled "Paid Time Off Guidelines."
Solution: Query expansion. Rephrase the user's question multiple ways and search with all variations:
original_query = "What's our PTO policy?"
# Use an LLM to generate variations
variations = llm.generate([
original_query,
"paid time off policy",
"vacation and sick leave guidelines",
"time off request procedures"
])
# Search with all variations and combine results
all_results = []
for query in variations:
all_results.extend(vector_search(query))
top_chunks = rerank(all_results)[:5] Evaluation: How to Know If Your RAG System Works
You can't improve what you don't measure. Track these metrics:
Key RAG Metrics
Retrieval Metrics:
- Recall@K: Of all relevant chunks, what % did you retrieve? (aim for 90%+)
- Precision@K: Of retrieved chunks, what % were actually relevant? (aim for 70%+)
- MRR (Mean Reciprocal Rank): How quickly do relevant results appear? (higher = better)
Generation Metrics:
- Answer Relevance: Does the answer actually address the question?
- Faithfulness: Is the answer grounded in the retrieved context?
- User Satisfaction: Thumbs up/down on answers (most important)
Build a test set of question-answer pairs from your domain. Run your RAG system against it regularly. When you change chunk size, embedding model, or retrieval strategy, you'll see immediately if things got better or worse.
Advanced Techniques (When Basic RAG Isn't Enough)
Re-ranking with Cross-Encoders
After initial retrieval, run chunks through a cross-encoder that scores query-chunk pairs. This is slower but much more accurate than pure vector search. Use it as a second stage:
- Vector search retrieves 50 candidates (fast)
- Cross-encoder reranks top 50 down to top 5 (slower, more accurate)
- Feed top 5 to LLM
This hybrid approach gives you 90% of the accuracy benefit at a fraction of the computational cost.
Self-Querying: Let the LLM Handle Metadata
Instead of manually parsing metadata filters, let the LLM generate them:
user_query = "What changed in our remote work policy after 2024?"
# LLM generates structured query
structured_query = llm.extract({
"semantic_query": "remote work policy changes",
"filters": {
"date": {"$gt": "2024-01-01"},
"document_type": "policy"
}
})
# Now you can search with both semantic and structured components
results = hybrid_search(structured_query) This is particularly powerful when users ask complex questions with multiple constraints.
Iterative Retrieval (Multi-Hop RAG)
Sometimes the first retrieval isn't enough. The LLM might need to ask follow-up questions to get complete information. This is where agentic RAG comes in—letting the LLM decide what to retrieve and when.
Frameworks like LangChain make this easier, though it adds complexity and latency. Use it sparingly, for cases where a single retrieval genuinely isn't sufficient.
Production Deployment Considerations
Latency Budgets
RAG is inherently slower than pure LLM calls. You're adding:
- Query embedding: ~50-100ms
- Vector search: ~100-500ms (depends on database and index size)
- LLM generation: 2-5 seconds
Total latency is typically 3-6 seconds for a single query. If that's too slow, consider:
- Caching frequent queries
- Using faster embedding models
- Streaming responses so users see progress
- Pre-computing embeddings for common questions
Cost Management
RAG costs add up fast at scale:
- Embedding costs: Every query needs embedding, every document needs chunking and embedding
- Vector database costs: Monthly fees based on # of vectors and queries
- LLM costs: Now you're using way more tokens per query (all that context)
For a system doing 10K queries/day with 5 retrieved chunks averaging 500 tokens each:
- Embedding: ~$0.60/day
- Vector DB: ~$50-200/month (varies by provider)
- LLM (GPT-4): ~$150/day just for the context tokens
This is where Claude 3.5 Sonnet's better pricing shines. Same quality, 70% cost reduction on the LLM side.
Real-World Example: What Good RAG Looks Like
We built a RAG system for a healthcare client's internal policy documentation. 15,000 pages of guidelines, procedures, and regulations. Here's what worked:
- Chunk size: 600 tokens with 100 token overlap
- Embedding: OpenAI text-embedding-3-small
- Vector DB: Pinecone (managed service, less ops overhead)
- Retrieval: Hybrid search with metadata filtering by department and document date
- LLM: Claude 3.5 Sonnet (lower hallucination rate critical for healthcare)
- Re-ranking: Cross-encoder as second stage
Results after 3 months in production:
- 92% retrieval recall on test set
- 87% user satisfaction (thumbs up/down)
- Average response time: 4.2 seconds
- Cost: ~$0.08 per query (acceptable for their use case)
The key was iterative improvement. We started simple, measured everything, and added complexity only where it moved the metrics.
Wrapping Up: Start Simple, Iterate Based on Data
Don't try to build the perfect RAG system from day one. Start with:
- Basic chunking (500 tokens, 50 overlap)
- Simple vector search
- Standard prompt template
- Measurement framework
Get that working, then optimize based on what the metrics tell you. Maybe you need better chunking. Maybe hybrid search. Maybe re-ranking. But you won't know until you have data.
RAG is incredibly powerful when done right. It lets you build LLM applications that are grounded in truth, specific to your domain, and updateable without retraining. Just respect the complexity and build iteratively.
Need Help Building Production RAG?
We've built RAG systems for healthcare, finance, and manufacturing clients. Happy to discuss your specific requirements and help you avoid the common pitfalls.
Let's TalkAbout the Author: Glenn Anderson has implemented RAG systems across multiple industries, with hands-on experience in document chunking strategies, vector database optimization, and production deployment patterns.