grEEff.dev
WorkProcessPricingInsights
Start Your Project
Engineering

January 31, 2026

25 min

RAG That Doesn't Suck: Build a Search That Beats 'Chat With Your Docs'

Most RAG implementations are garbage. This one isn't. A comprehensive tutorial covering semantic chunking, retrieval-optimized embeddings, reranking, citation-grounded generation, and an eval harness—in ~300 lines of code.

Pio Greeff

Pio Greeff

Founder & Lead Developer

Deep dive article

RAG That Doesn't Suck: Build a Search That Beats "Chat With Your Docs"

Most RAG implementations are garbage. This one isn't.


60-Second Quickstart

Output:

That's it. Cited answers from your docs in under a minute.

Want to understand why it works? Keep reading.


The Problem: Why 90% of RAG Fails

You've seen the demos. Someone spins up LangChain, chunks documents at 500 characters, throws embeddings into Pinecone, and calls it "AI-powered search."

Then users ask real questions. The system returns vaguely related paragraphs. The LLM hallucinates connections. Everyone pretends it works.

Here's what's actually broken:

FailureWhat They DoWhat Happens
Dumb chunkingFixed 500-char splitsSentences torn apart. "...financial statements. Chapter 4: Marketing Strategy begins with..." tells the model nothing.
Wrong embeddingsGeneric sentence-transformers"Contract terms?" matches "The contract was signed" because both contain "contract." One is useless.
No rerankingTrust vector similarityA tangential mention ranks higher than a direct answer using different vocabulary.
top-k too lowRetrieve 3-5 chunksMiss relevant context scattered across the document.
top-k too highRetrieve 50 chunksFlood the LLM with noise. It hallucinates to fill gaps.
No query rewritingPass raw user query"What about the thing from yesterday's meeting?" retrieves nothing.
No metadata filteringSearch everything always2019 policy docs contaminate answers about 2024 procedures.
Phantom citationsAppend "[Source: doc.pdf]" to everythingUsers can't verify which passage supports which claim. Decoration, not citation.
No eval harness"Looks good to me"You have no idea if changes improve or break the system.

This tutorial fixes all of them.


Before vs After: The Same Query, Two Systems

Document: A 40-page employee handbook.

Query: "Can I expense a co-working space membership?"

❌ Naive RAG (fixed chunks, no reranking)

Retrieved chunks:

Generated answer:

Professional membership fees are reimbursable up to $500 annually [1]. Remote employees also receive a $200 home office stipend [2]. Co-working is encouraged for team collaboration [3].

What's wrong: None of these chunks answer the question. The system found semantically similar text (membership, workspace, co-working) but not the actual policy. The answer sounds plausible but doesn't address the question.


✅ This RAG (semantic chunks + reranking)

Retrieved chunks (after reranking):

Generated answer:

Yes, you can expense a co-working space membership if you're a remote employee. Full-time remote employees can expense up to $300/month, while part-time remote employees are eligible for up to $150/month [1]. You'll need manager approval and must submit monthly receipts [1]. Note: you must be designated as "remote" or "hybrid" in Workday to qualify—office-based employees aren't eligible [2].

What's different:

  • Semantic chunking kept the full policy section intact
  • Reranking promoted the directly relevant chunk from position 11 → position 1
  • Citations point to specific, verifiable passages

When NOT to Use RAG

RAG isn't always the answer. Sometimes it's expensive overhead for a problem that doesn't need it.

Skip RAG when:

SituationBetter Alternative
Your docs fit in contextJust stuff them in the prompt. Claude handles 200K tokens. If your entire knowledge base is under 100K tokens, RAG adds latency and complexity for no benefit.
Questions are predictableBuild a structured FAQ with exact-match lookup. Faster, cheaper, more reliable.
You need 100% accuracyRAG is probabilistic. For compliance-critical answers, use deterministic retrieval (SQL queries, structured data) with RAG as a fallback.
Docs change hourlyRe-embedding is expensive. If your corpus churns constantly, consider a keyword-based system (Elasticsearch) with semantic reranking on top.
Users ask one-word queriesRAG needs context to work. "Benefits?" is ambiguous. Build a UI that forces specificity or use query expansion.

Use RAG when:

  • Corpus is too large for context window (500K+ tokens)
  • Questions are unpredictable and varied
  • You need to cite specific sources
  • Documents are semi-structured (policies, contracts, technical docs)
  • Acceptable accuracy is 85-95%, not 100%

Know your constraints before you build.


Architecture Overview

The key insight: vector search is coarse retrieval—it narrows millions of chunks to dozens. Reranking is fine retrieval—it identifies the actually relevant ones. The LLM is synthesis—it reads the winners and writes an answer with verifiable citations.


Part 1: Semantic Chunking That Respects Boundaries

Forget fixed-size chunks. We split on semantic boundaries: paragraph breaks, section headers, and sentence endings. Then we merge small chunks until they hit a target size, preserving context.

Key decisions:

  • Target 512 tokens, max 1024. Small enough for precise retrieval, large enough for context.
  • Section tracking. We store the section header with each chunk—this appears in citations.
  • Sentence-aware splitting. Never cut mid-sentence.
  • Aggressive merging. A 50-token chunk is noise.

Part 2: Embeddings Optimized for Retrieval

Not all embedding models are equal. General-purpose models optimize for similarity. Retrieval models optimize for query-document matching.

The instruction prefix matters. Without "Represent this sentence for searching relevant passages:", you lose 5-10% recall. Most tutorials skip this.


Part 3: Vector Store (Keep It Simple)

You don't need Pinecone. For under 1M chunks, numpy + faiss-cpu is faster to set up and fast enough to query.


Part 4: Reranking—Where Relevance Actually Happens

Vector search is fuzzy. Reranking tells you which results actually answer the question.

The impact is dramatic. In testing, reranking consistently promotes correct answers from position 8-15 to position 1-3.


Part 5: Citation-Grounded Generation

The model must cite specific chunks, and we verify those citations map to real sources.


Part 6: Putting It Together


Part 7: Eval Harness (Measure or You're Guessing)

A RAG system without evaluation is a guess. Here's a minimal harness that measures what matters.

Sample output:

What this catches:

  • Source hit rate: Are we retrieving from the right documents?
  • Section hit rate: Are we finding the right part of the document?
  • Content hit rate: Does the answer contain expected information?
  • Hallucination-free rate: Are we making things up or citing phantom sources?

Run this after every change. If the pass rate drops, you broke something.


Cost Breakdown (Because Money Is Real)

Everyone pretends infrastructure is free. It isn't. Here's what this system actually costs.

One-Time Indexing Costs

ComponentCost per 1M tokensNotes
BGE-base embeddings$0 (local)~3 min on M1 Mac, 10 min on CPU
OpenAI text-embedding-3-small$0.02If you want cloud embeddings
OpenAI text-embedding-3-large$0.13Higher quality, 6x cost

Example: A 500-page technical manual ≈ 250K tokens → $0.005 with OpenAI small, $0 with local BGE.

Per-Query Costs

ComponentCostLatency
Vector search (FAISS)$0~5ms for 100K chunks
Reranking (local MiniLM)$0~50ms for 20 chunks
Reranking (Cohere API)$0.001~200ms, higher accuracy
Claude Sonnet generation~$0.003-0.01Depends on chunk size + answer length
Claude Haiku generation~$0.0005-0.00210x cheaper, slightly lower quality

Example query breakdown:

At scale:

Queries/monthClaude SonnetClaude Haiku
1,000$4$0.50
10,000$40$5
100,000$400$50

Cost Optimization Strategies

  1. Use Haiku for simple queries. Route based on query complexity.
  2. Cache frequent queries. Same question = same answer.
  3. Reduce rerank_k. 3 chunks vs 5 chunks saves 40% on generation tokens.
  4. Batch index updates. Re-embed documents nightly, not on every change.

Tuning Guide

Chunking

ParameterDefaultAdjust when
target_size512↑ for long-form docs (contracts). ↓ for FAQs.
max_size1024↑ if using Claude (200K context).

Retrieval

ParameterDefaultAdjust when
retrieval_k20↑ if relevant docs aren't found. ↓ if reranking is slow.
rerank_k5↑ for complex questions. ↓ for cost savings.

Models

ComponentBudgetBalancedPremium
Embeddingsbge-small (33M)bge-base (109M)bge-large (335M)
RerankerSkip (not recommended)ms-marco-MiniLM (22M)bge-reranker-base (278M)
GeneratorClaude HaikuClaude SonnetClaude Opus

Common Failure Modes & Fixes

ProblemDiagnosisFix
Relevant docs not retrievedRun search_raw(), check if they appear anywhereIncrease retrieval_k to 50. Try larger embeddings.
Wrong docs ranked firstVector search finds them, reranking demotes themUpgrade reranker to bge-reranker-base.
Citations don't match claimsModel is misattributingIncrease rerank_k. Add "cite conservatively" to prompt.
Answers are too vagueChunks are too shortIncrease target_size to 768+.
"I don't have enough information" (but you do)Relevant chunk isn't in top-kCheck reranker scores. Chunk may be split wrong.
Hallucinated citation numbersModel generates [6] when only 5 chunks providedOur code catches this—check citations for errors.

Dependencies & Installation

For GPU: pip install faiss-gpu


What This Doesn't Cover

This is a foundation, not a production system. You'll need to add:

  1. PDF/DOCX parsing — Use pypdf, pdfplumber, or unstructured
  2. Authentication — Wrap endpoints with your auth layer
  3. Streaming — Anthropic's API supports it; wire it through
  4. Hybrid search — Combine BM25 keyword search with vector for exact-match queries
  5. Metadata filtering — Filter by date, author, category before vector search
  6. Query rewriting — Expand "the thing from yesterday" into searchable terms

Each is a tutorial on its own.


The Bottom Line

Most RAG fails because people skip the hard parts: intelligent chunking, retrieval-optimized embeddings, reranking, and verifiable citations.

This implementation does all of them in ~300 lines of code. No frameworks. No abstractions hiding footguns.

Build it. Eval it. Ship it.


Found a bug? Have a question? Open an issue or hit me up at [email protected]

Found this useful?

Share it with your network

Starter Kits

Build the architecture behind this article

Ship faster with production-ready Next.js + Cloudflare starter kits. Pick one path, or take the full bundle.