I'll be honest—RAG (Retrieval-Augmented Generation) sounds simple on paper. Grab relevant documents, feed them to an LLM, get better answers. Easy, right? Well, after building a dozen or so production RAG systems, I can tell you there's a lot more nuance than the tutorials let on. Here's what actually works.

Why RAG Matters (and Why It's Harder Than You Think)

The promise of RAG is compelling: give your LLM access to your private data without retraining or fine-tuning. Point it at your documentation, policies, research papers—whatever—and suddenly ChatGPT knows about your specific business context. In practice though, getting this to work well requires careful attention to details that aren't obvious upfront.

I've seen plenty of teams rush into RAG thinking it'll solve all their problems, only to end up with a system that returns irrelevant documents 40% of the time. The fundamentals matter here more than in almost any other LLM application.

The Core Components (What You're Actually Building)

Every RAG system has the same basic architecture, though implementations vary wildly:

The RAG Pipeline

Document Ingestion → Load and preprocess your source documents
Chunking → Split documents into manageable pieces
Embedding → Convert text chunks to vector representations
Storage → Index vectors in a database (Pinecone, Weaviate, etc.)
Retrieval → Find relevant chunks based on user query
Generation → Feed retrieved context to LLM for final answer

Sounds straightforward. But each of these steps has gotchas that'll bite you in production if you're not careful.

Document Chunking: The Foundation That Everyone Gets Wrong

This is where most RAG systems fail. Chunking determines what information the LLM sees, and bad chunking means even the best retrieval won't save you.

Chunk Size: The Goldilocks Problem

Too small (128-256 tokens) and you lose context. A chunk might say "the policy changed in 2024" but not mention which policy or what changed. Too large (2000+ tokens) and you're stuffing irrelevant information into your prompt, wasting tokens and confusing the model.

After extensive testing, we've landed on these guidelines:

General documents: 400-600 tokens with 50-100 token overlap
Technical documentation: 600-800 tokens (needs more context)
FAQ/Q&A content: 200-300 tokens (more focused chunks work better)
Legal/contracts: 800-1000 tokens (can't break context mid-clause)

That overlap is crucial, by the way. Without it, important information that spans chunk boundaries gets lost. I've seen this cause weird edge cases where the system can't answer questions that bridge two chunks.

Smart Chunking Strategies

Don't just split on character count. Use structure when you have it:

# Better chunking respects document structure
chunks = []

# For markdown documents
for section in document.sections:
    if len(section) < 600:
        chunks.append(section)  # Keep small sections whole
    else:
        # Split larger sections at paragraph boundaries
        chunks.extend(split_on_paragraphs(section, target=500))

# Add metadata to each chunk
for chunk in chunks:
    chunk.metadata = {
        "source": document.title,
        "section": section.heading,
        "page": get_page_number(chunk)
    }

The metadata is super important. When you retrieve chunks later, you want to know where they came from. Plus, you can use metadata for filtering (only search within certain document types, dates, etc.).

Choosing Your Embedding Model

This is less critical than chunking, but still matters. The embedding model converts your text into vectors that capture semantic meaning.

Model	Dimensions	Cost	Best For
OpenAI text-embedding-3-small	1536	$0.02/1M tokens	General purpose, best value
OpenAI text-embedding-3-large	3072	$0.13/1M tokens	Higher accuracy needs
Cohere embed-english-v3	1024	$0.10/1M tokens	Strong semantic search
sentence-transformers	384-768	Free (self-hosted)	Budget constraints

Honestly? For most use cases, OpenAI's text-embedding-3-small is perfectly fine. It's cheap, fast, and performs well. We only reach for the larger models when clients have very specific accuracy requirements or unusual domain language.

One thing to watch out for: whichever embedding model you choose, you're stuck with it unless you want to re-embed everything. Switching models means rebuilding your entire vector database. Plan accordingly.

Retrieval Strategies That Actually Work

Simple similarity search works okay, but you can do way better with a few tweaks.

Hybrid Search: Best of Both Worlds

Pure vector search misses exact matches sometimes. If someone asks "What's the policy on remote work?" and your document says "remote work policy," keyword search will catch that instantly. But vector search handles "What's our WFH policy?" better because it understands the semantic similarity.

Combine both:

# Hybrid search approach
def retrieve_chunks(query, top_k=5):
    # Get semantic matches
    vector_results = vector_db.search(
        embedding=embed(query),
        limit=top_k * 2  # Get more than we need
    )

    # Get keyword matches
    keyword_results = bm25_search(
        query=query,
        limit=top_k * 2
    )

    # Combine and rerank
    combined = merge_and_rerank(
        vector_results,
        keyword_results,
        weights=[0.7, 0.3]  # Favor semantic
    )

    return combined[:top_k]

The exact weights depend on your use case, but starting with 70% vector / 30% keyword works well for most applications. Adjust based on what you see in production.

Metadata Filtering: The Underrated Feature

This is huge and often overlooked. Let's say you have documents from different departments, time periods, or document types. You don't want to search everything every time.

# Filter by metadata before searching
results = vector_db.search(
    embedding=embed(query),
    filter={
        "department": "Engineering",
        "date": {"$gte": "2024-01-01"},
        "document_type": {"$in": ["policy", "guideline"]}
    },
    limit=5
)

This makes results way more relevant and speeds up queries. Plus, it helps with data governance—you can enforce who sees what at the retrieval layer.

The LLM Generation Step: Prompt Engineering for RAG

You've retrieved relevant chunks. Now what? The prompt you feed to the LLM determines whether users get good answers or hallucinated nonsense.

A Production-Ready RAG Prompt Template

system_prompt = """You are a helpful assistant that answers questions
based on the provided context. Follow these rules strictly:

1. Only use information from the context below to answer questions
2. If the context doesn't contain the answer, say "I don't have enough
   information to answer that" - do NOT make up information
3. Cite which document each piece of information comes from
4. If the context is contradictory, mention both perspectives

Context:
{retrieved_chunks}

Remember: Only answer based on the context above."""

user_prompt = """Question: {user_question}

Please provide a comprehensive answer based on the context provided."""

A few things I've learned the hard way about RAG prompts:

Be explicit about not hallucinating. LLMs want to be helpful and will make stuff up if you let them
Include source citation. Users need to verify answers, especially in regulated industries
Handle contradictions gracefully. Your document set might have conflicting information
Format matters. Clearly delineate where context ends and the question begins

Common RAG Pitfalls (and How to Avoid Them)

1. The "Lost in the Middle" Problem

LLMs pay more attention to information at the start and end of the context. Stuff in the middle gets ignored. If you're feeding 10 chunks to the model, consider reranking them so the most relevant appears first and last.

2. Chunk Retrieval Without Context

You retrieve a chunk that says "This replaces the previous policy." Great, but which policy? The original chunk didn't include that context.

Solutions:

Include surrounding chunks in the retrieval (grab chunk N-1 and N+1)
Add document/section headers to each chunk
Use larger chunk sizes for documents where context matters

3. Outdated Information

Your vector database doesn't automatically update when source documents change. Build a refresh pipeline and version your chunks so you can track what's current.

4. Query-Document Mismatch

Sometimes user questions and document phrasing are too different for good retrieval. "What's our PTO policy?" might not match a document titled "Paid Time Off Guidelines."

Solution: Query expansion. Rephrase the user's question multiple ways and search with all variations:

original_query = "What's our PTO policy?"

# Use an LLM to generate variations
variations = llm.generate([
    original_query,
    "paid time off policy",
    "vacation and sick leave guidelines",
    "time off request procedures"
])

# Search with all variations and combine results
all_results = []
for query in variations:
    all_results.extend(vector_search(query))

top_chunks = rerank(all_results)[:5]

Evaluation: How to Know If Your RAG System Works

You can't improve what you don't measure. Track these metrics:

Key RAG Metrics

Retrieval Metrics:

Recall@K: Of all relevant chunks, what % did you retrieve? (aim for 90%+)
Precision@K: Of retrieved chunks, what % were actually relevant? (aim for 70%+)
MRR (Mean Reciprocal Rank): How quickly do relevant results appear? (higher = better)

Generation Metrics:

Answer Relevance: Does the answer actually address the question?
Faithfulness: Is the answer grounded in the retrieved context?
User Satisfaction: Thumbs up/down on answers (most important)

Build a test set of question-answer pairs from your domain. Run your RAG system against it regularly. When you change chunk size, embedding model, or retrieval strategy, you'll see immediately if things got better or worse.

Advanced Techniques (When Basic RAG Isn't Enough)

Re-ranking with Cross-Encoders

After initial retrieval, run chunks through a cross-encoder that scores query-chunk pairs. This is slower but much more accurate than pure vector search. Use it as a second stage:

Vector search retrieves 50 candidates (fast)
Cross-encoder reranks top 50 down to top 5 (slower, more accurate)
Feed top 5 to LLM

This hybrid approach gives you 90% of the accuracy benefit at a fraction of the computational cost.

Self-Querying: Let the LLM Handle Metadata

Instead of manually parsing metadata filters, let the LLM generate them:

user_query = "What changed in our remote work policy after 2024?"

# LLM generates structured query
structured_query = llm.extract({
    "semantic_query": "remote work policy changes",
    "filters": {
        "date": {"$gt": "2024-01-01"},
        "document_type": "policy"
    }
})

# Now you can search with both semantic and structured components
results = hybrid_search(structured_query)

This is particularly powerful when users ask complex questions with multiple constraints.

Iterative Retrieval (Multi-Hop RAG)

Sometimes the first retrieval isn't enough. The LLM might need to ask follow-up questions to get complete information. This is where agentic RAG comes in—letting the LLM decide what to retrieve and when.

Frameworks like LangChain make this easier, though it adds complexity and latency. Use it sparingly, for cases where a single retrieval genuinely isn't sufficient.

Production Deployment Considerations

Latency Budgets

RAG is inherently slower than pure LLM calls. You're adding:

Query embedding: ~50-100ms
Vector search: ~100-500ms (depends on database and index size)
LLM generation: 2-5 seconds

Total latency is typically 3-6 seconds for a single query. If that's too slow, consider:

Caching frequent queries
Using faster embedding models
Streaming responses so users see progress
Pre-computing embeddings for common questions

Cost Management

RAG costs add up fast at scale:

Embedding costs: Every query needs embedding, every document needs chunking and embedding
Vector database costs: Monthly fees based on # of vectors and queries
LLM costs: Now you're using way more tokens per query (all that context)

For a system doing 10K queries/day with 5 retrieved chunks averaging 500 tokens each:

Embedding: ~$0.60/day
Vector DB: ~$50-200/month (varies by provider)
LLM (GPT-4): ~$150/day just for the context tokens

This is where Claude 3.5 Sonnet's better pricing shines. Same quality, 70% cost reduction on the LLM side.

Real-World Example: What Good RAG Looks Like

We built a RAG system for a healthcare client's internal policy documentation. 15,000 pages of guidelines, procedures, and regulations. Here's what worked:

Chunk size: 600 tokens with 100 token overlap
Embedding: OpenAI text-embedding-3-small
Vector DB: Pinecone (managed service, less ops overhead)
Retrieval: Hybrid search with metadata filtering by department and document date
LLM: Claude 3.5 Sonnet (lower hallucination rate critical for healthcare)
Re-ranking: Cross-encoder as second stage

Results after 3 months in production:

92% retrieval recall on test set
87% user satisfaction (thumbs up/down)
Average response time: 4.2 seconds
Cost: ~$0.08 per query (acceptable for their use case)

The key was iterative improvement. We started simple, measured everything, and added complexity only where it moved the metrics.

Wrapping Up: Start Simple, Iterate Based on Data

Don't try to build the perfect RAG system from day one. Start with:

Basic chunking (500 tokens, 50 overlap)
Simple vector search
Standard prompt template
Measurement framework

Get that working, then optimize based on what the metrics tell you. Maybe you need better chunking. Maybe hybrid search. Maybe re-ranking. But you won't know until you have data.

RAG is incredibly powerful when done right. It lets you build LLM applications that are grounded in truth, specific to your domain, and updateable without retraining. Just respect the complexity and build iteratively.

Need Help Building Production RAG?

We've built RAG systems for healthcare, finance, and manufacturing clients. Happy to discuss your specific requirements and help you avoid the common pitfalls.

Let's Talk

About the Author: Glenn Anderson has implemented RAG systems across multiple industries, with hands-on experience in document chunking strategies, vector database optimization, and production deployment patterns.

← Back to Blog

RAG Architecture Best Practices for Production Systems