Build a Search That Beats 'Chat With Your Docs'

RAG That Doesn't Suck: Build a Search That Beats "Chat With Your Docs"

Most RAG implementations are garbage. This one isn't.

60-Second Quickstart


Bash
# Clone
git clone https://github.com/yourusername/rag-that-works.git
cd rag-that-works
 
# Install
pip install -r requirements.txt
 
# Add your API key
export ANTHROPIC_API_KEY="sk-ant-..."
 
# Index a document
[Python](https://www.python.org/) cli.py index ./docs/employee_handbook.md
 
# Ask a question
python cli.py search "What's the PTO policy for contractors?"

Output:


Answer:
Contractors receive 15 days of paid time off annually [1], prorated based on 
their start date [2]. Unused PTO does not roll over to the following year [1].
 
Sources:
  [1]: employee_handbook.md (section: "Contractor Benefits", chars 2847-3201)
  [2]: employee_handbook.md (section: "Onboarding", chars 892-1104)

That's it. Cited answers from your docs in under a minute.

Want to understand why it works? Keep reading.

The Problem: Why 90% of RAG Fails

You've seen the demos. Someone spins up LangChain, chunks documents at 500 characters, throws embeddings into Pinecone, and calls it "AI-powered search."

Then users ask real questions. The system returns vaguely related paragraphs. The LLM hallucinates connections. Everyone pretends it works.

Here's what's actually broken:

Failure	What They Do	What Happens
Dumb chunking	Fixed 500-char splits	Sentences torn apart. "...financial statements. Chapter 4: Marketing Strategy begins with..." tells the model nothing.
Wrong embeddings	Generic sentence-transformers	"Contract terms?" matches "The contract was signed" because both contain "contract." One is useless.
No reranking	Trust vector similarity	A tangential mention ranks higher than a direct answer using different vocabulary.
top-k too low	Retrieve 3-5 chunks	Miss relevant context scattered across the document.
top-k too high	Retrieve 50 chunks	Flood the LLM with noise. It hallucinates to fill gaps.
No query rewriting	Pass raw user query	"What about the thing from yesterday's meeting?" retrieves nothing.
No metadata filtering	Search everything always	2019 policy docs contaminate answers about 2024 procedures.
Phantom citations	Append "[Source: doc.pdf]" to everything	Users can't verify which passage supports which claim. Decoration, not citation.
No eval harness	"Looks good to me"	You have no idea if changes improve or break the system.

This tutorial fixes all of them.

Before vs After: The Same Query, Two Systems

Document: A 40-page employee handbook.

Query: "Can I expense a co-working space membership?"

❌ Naive RAG (fixed chunks, no reranking)

Retrieved chunks:


[1] "...membership fees for professional organizations are reimbursable 
    up to $500 annually. Submit receipts through Expensify within 30..."
    
[2] "...workspace ergonomics are important. The company provides a $200 
    home office stipend for remote employees to purchase..."
    
[3] "...co-working is encouraged for team collaboration. Regional teams 
    should coordinate quarterly meetups at designated..."

Generated answer:

“

Professional membership fees are reimbursable up to $500 annually [1]. Remote employees also receive a $200 home office stipend [2]. Co-working is encouraged for team collaboration [3].

What's wrong: None of these chunks answer the question. The system found semantically similar text (membership, workspace, co-working) but not the actual policy. The answer sounds plausible but doesn't address the question.

✅ This RAG (semantic chunks + reranking)

Retrieved chunks (after reranking):


[1] "Co-working Space Reimbursement: Full-time remote employees may 
    expense co-working space memberships up to $300/month. Part-time 
    remote employees are eligible for up to $150/month. Requires 
    manager approval and monthly receipt submission."
    
[2] "Expense Eligibility: To qualify for workspace reimbursements, 
    employees must be designated as 'remote' or 'hybrid' in Workday. 
    Office-based employees are not eligible for co-working benefits."

Generated answer:

“

Yes, you can expense a co-working space membership if you're a remote employee. Full-time remote employees can expense up to $300/month, while part-time remote employees are eligible for up to $150/month [1]. You'll need manager approval and must submit monthly receipts [1]. Note: you must be designated as "remote" or "hybrid" in Workday to qualify—office-based employees aren't eligible [2].

What's different:

Semantic chunking kept the full policy section intact
Reranking promoted the directly relevant chunk from position 11 → position 1
Citations point to specific, verifiable passages

When NOT to Use RAG

RAG isn't always the answer. Sometimes it's expensive overhead for a problem that doesn't need it.

Skip RAG when:

Situation	Better Alternative
Your docs fit in context	Just stuff them in the prompt. Claude handles 200K tokens. If your entire knowledge base is under 100K tokens, RAG adds latency and complexity for no benefit.
Questions are predictable	Build a structured FAQ with exact-match lookup. Faster, cheaper, more reliable.
You need 100% accuracy	RAG is probabilistic. For compliance-critical answers, use deterministic retrieval (SQL queries, structured data) with RAG as a fallback.
Docs change hourly	Re-embedding is expensive. If your corpus churns constantly, consider a keyword-based system (Elasticsearch) with semantic reranking on top.
Users ask one-word queries	RAG needs context to work. "Benefits?" is ambiguous. Build a UI that forces specificity or use query expansion.

Use RAG when:

Corpus is too large for context window (500K+ tokens)
Questions are unpredictable and varied
You need to cite specific sources
Documents are semi-structured (policies, contracts, technical docs)
Acceptable accuracy is 85-95%, not 100%

Know your constraints before you build.

Architecture Overview


┌─────────────────────────────────────────────────────────────────┐
│                        USER QUERY                                │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    QUERY EMBEDDING                               │
│              (same model as document chunks)                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                 VECTOR RETRIEVAL (Top 20)                        │
│              Coarse filtering—cast a wide net                    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   CROSS-ENCODER RERANKING                        │
│              Scores each chunk against query                     │
│              Returns top 5—the actually relevant ones            │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     LLM GENERATION                               │
│        Answer with inline citations [1], [2], etc.              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    CITATION MAPPING                              │
│          Link each [n] to exact chunk + character offset        │
└─────────────────────────────────────────────────────────────────┘

The key insight: vector search is coarse retrieval—it narrows millions of chunks to dozens. Reranking is fine retrieval—it identifies the actually relevant ones. The LLM is synthesis—it reads the winners and writes an answer with verifiable citations.

Part 1: Semantic Chunking That Respects Boundaries

Forget fixed-size chunks. We split on semantic boundaries: paragraph breaks, section headers, and sentence endings. Then we merge small chunks until they hit a target size, preserving context.


PYTHON
# chunker.py
import re
from dataclasses import dataclass
from typing import List
 
@dataclass
class Chunk:
    text: str
    source: str
    chunk_id: int
    start_char: int
    end_char: int
    section: str = ""  # Optional: section header for context
 
class SemanticChunker:
    def __init__(
        self,
        target_size: int = 512,      # target tokens (roughly)
        max_size: int = 1024,         # hard ceiling
        overlap_sentences: int = 1    # context carry-over
    ):
        self.target_size = target_size
        self.max_size = max_size
        self.overlap_sentences = overlap_sentences
        self.chars_per_token = 4  # Rough estimate for English
    
    def chunk(self, text: str, source: str) -> List[Chunk]:
        """
        Split text into semantic chunks.
        
        Strategy:
        1. Split on major boundaries (double newlines, headers)
        2. Split long sections on sentence boundaries
        3. Merge tiny chunks with neighbors
        4. Track section headers for citation context
        """
        text = re.sub(r'\n{3,}', '\n\n', text.strip())
        
        # Extract section structure
        sections = self._split_by_headers(text)
        
        all_chunks = []
        for section_title, section_text in sections:
            section_chunks = self._chunk_section(section_text, source, section_title)
            all_chunks.extend(section_chunks)
        
        # Assign sequential IDs and character offsets
        char_offset = 0
        for i, chunk in enumerate(all_chunks):
            chunk.chunk_id = i
            start = text.find(chunk.text[:50], char_offset)
            if start == -1:
                start = char_offset
            chunk.start_char = start
            chunk.end_char = start + len(chunk.text)
            char_offset = start + len(chunk.text) // 2
        
        return all_chunks
    
    def _split_by_headers(self, text: str) -> List[tuple]:
        """Split text by headers, returning (header, content) tuples."""
        # Match markdown headers or ALL CAPS lines
        header_pattern = r'^(#{1,6}\s+.+|[A-Z][A-Z\s]{10,})$'
        
        sections = []
        current_header = ""
        current_content = []
        
        for line in text.split('\n'):
            if re.match(header_pattern, line.strip()):
                if current_content:
                    sections.append((current_header, '\n'.join(current_content)))
                current_header = line.strip().lstrip('#').strip()
                current_content = []
            else:
                current_content.append(line)
        
        if current_content:
            sections.append((current_header, '\n'.join(current_content)))
        
        return sections if sections else [("", text)]
    
    def _chunk_section(self, text: str, source: str, section: str) -> List[Chunk]:
        """Chunk a single section."""
        paragraphs = re.split(r'\n\n+', text)
        
        # Split large paragraphs on sentences
        pieces = []
        for para in paragraphs:
            if self._estimate_tokens(para) > self.max_size:
                pieces.extend(self._split_sentences(para))
            else:
                pieces.append(para)
        
        # Merge small pieces
        merged = self._merge_pieces(pieces)
        
        return [
            Chunk(text=t.strip(), source=source, chunk_id=0, 
                  start_char=0, end_char=0, section=section)
            for t in merged if t.strip()
        ]
    
    def _estimate_tokens(self, text: str) -> int:
        return len(text) // self.chars_per_token
    
    def _split_sentences(self, text: str) -> List[str]:
        pattern = r'(?<=[.!?])\s+(?=[A-Z])'
        return [s.strip() for s in re.split(pattern, text) if s.strip()]
    
    def _merge_pieces(self, pieces: List[str]) -> List[str]:
        if not pieces:
            return []
        
        merged = []
        current = pieces[0]
        
        for piece in pieces[1:]:
            combined_size = self._estimate_tokens(current + " " + piece)
            
            if combined_size <= self.target_size:
                current = current + "\n\n" + piece
            elif self._estimate_tokens(current) < self.target_size // 3:
                current = current + "\n\n" + piece
            else:
                merged.append(current)
                current = piece
        
        merged.append(current)
        return merged

Key decisions:

Target 512 tokens, max 1024. Small enough for precise retrieval, large enough for context.
Section tracking. We store the section header with each chunk—this appears in citations.
Sentence-aware splitting. Never cut mid-sentence.
Aggressive merging. A 50-token chunk is noise.

Part 2: Embeddings Optimized for Retrieval

Not all embedding models are equal. General-purpose models optimize for similarity. Retrieval models optimize for query-document matching.


PYTHON
# embedder.py
import numpy as np
from typing import List
from sentence_transformers import SentenceTransformer
 
class Embedder:
    def __init__(self, model_name: str = "BAAI/bge-base-en-v1.5"):
        """
        BGE-base: trained on retrieval tasks, handles query/doc asymmetry,
        768 dimensions, instruction-following for query enhancement.
        """
        self.model = SentenceTransformer(model_name)
        self.dimension = 768
    
    def embed_documents(self, texts: List[str]) -> np.ndarray:
        """Embed document chunks. No prefix needed."""
        return self.model.encode(
            texts,
            normalize_embeddings=True,
            show_progress_bar=True
        )
    
    def embed_query(self, query: str) -> np.ndarray:
        """
        Embed a search query with instruction prefix.
        
        This is critical: BGE models gain 5-10% recall with the prefix.
        """
        instruction = "Represent this sentence for searching relevant passages: "
        return self.model.encode(
            instruction + query,
            normalize_embeddings=True
        )

The instruction prefix matters. Without "Represent this sentence for searching relevant passages:", you lose 5-10% recall. Most tutorials skip this.

Part 3: Vector Store (Keep It Simple)

You don't need Pinecone. For under 1M chunks, numpy + faiss-cpu is faster to set up and fast enough to query.


PYTHON
# vector_store.py
import numpy as np
import faiss
import pickle
from pathlib import Path
from typing import List, Tuple, Optional
from dataclasses import asdict
 
from chunker import Chunk
from embedder import Embedder
 
class VectorStore:
    def __init__(self, embedder: Embedder, index_path: Optional[str] = None):
        self.embedder = embedder
        self.index: Optional[faiss.IndexFlatIP] = None
        self.chunks: List[Chunk] = []
        self.index_path = index_path
        
        if index_path and Path(index_path).exists():
            self.load(index_path)
    
    def add_documents(self, chunks: List[Chunk]):
        if not chunks:
            return
        
        texts = [c.text for c in chunks]
        embeddings = self.embedder.embed_documents(texts)
        
        if self.index is None:
            self.index = faiss.IndexFlatIP(embeddings.shape[1])
        
        self.index.add(embeddings.astype(np.float32))
        self.chunks.extend(chunks)
    
    def search(self, query: str, k: int = 20) -> List[Tuple[Chunk, float]]:
        if self.index is None or not self.chunks:
            return []
        
        query_emb = self.embedder.embed_query(query).reshape(1, -1).astype(np.float32)
        scores, indices = self.index.search(query_emb, min(k, len(self.chunks)))
        
        return [(self.chunks[idx], float(score)) 
                for score, idx in zip(scores[0], indices[0]) 
                if idx < len(self.chunks)]
    
    def save(self, path: str):
        path = Path(path)
        path.mkdir(parents=True, exist_ok=True)
        if self.index:
            faiss.write_index(self.index, str(path / "index.faiss"))
        with open(path / "chunks.pkl", "wb") as f:
            pickle.dump([asdict(c) for c in self.chunks], f)
    
    def load(self, path: str):
        path = Path(path)
        if (path / "index.faiss").exists():
            self.index = faiss.read_index(str(path / "index.faiss"))
        if (path / "chunks.pkl").exists():
            with open(path / "chunks.pkl", "rb") as f:
                self.chunks = [Chunk(**d) for d in pickle.load(f)]

Part 4: Reranking—Where Relevance Actually Happens

Vector search is fuzzy. Reranking tells you which results actually answer the question.


PYTHON
# reranker.py
from typing import List, Tuple
from sentence_transformers import CrossEncoder
 
from chunker import Chunk
 
class Reranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        """
        Cross-encoder: processes query+doc together, catches interactions
        bi-encoders miss. Slower but far more accurate.
        
        For critical accuracy: use BAAI/bge-reranker-base (10x larger)
        """
        self.model = CrossEncoder(model_name)
    
    def rerank(
        self, 
        query: str, 
        chunks: List[Tuple[Chunk, float]], 
        top_k: int = 5
    ) -> List[Tuple[Chunk, float]]:
        if not chunks:
            return []
        
        pairs = [(query, chunk.text) for chunk, _ in chunks]
        scores = self.model.predict(pairs)
        
        reranked = [(chunks[i][0], float(scores[i])) for i in range(len(chunks))]
        reranked.sort(key=lambda x: x[1], reverse=True)
        
        return reranked[:top_k]

The impact is dramatic. In testing, reranking consistently promotes correct answers from position 8-15 to position 1-3.

Part 5: Citation-Grounded Generation

The model must cite specific chunks, and we verify those citations map to real sources.


PYTHON
# generator.py
import re
from typing import List, Tuple, Dict, Any
from dataclasses import dataclass
import anthropic
 
from chunker import Chunk
 
@dataclass
class CitedAnswer:
    answer: str
    citations: List[Dict[str, Any]]
    chunks_used: List[Chunk]
 
class Generator:
    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)
    
    def generate(
        self, 
        query: str, 
        chunks: List[Tuple[Chunk, float]],
        model: str = "claude-sonnet-4-20250514"
    ) -> CitedAnswer:
        # Build numbered context
        context_parts = []
        chunk_map = {}
        
        for i, (chunk, score) in enumerate(chunks, 1):
            section_info = f" (section: \"{chunk.section}\")" if chunk.section else ""
            context_parts.append(
                f"[{i}] Source: {chunk.source}{section_info}\n{chunk.text}"
            )
            chunk_map[i] = chunk
        
        context = "\n\n---\n\n".join(context_parts)
        
        system_prompt = """You are a precise research assistant. Answer using ONLY the provided excerpts.
 
Rules:
1. Use ONLY information from the excerpts. No outside knowledge.
2. Cite with [1], [2], etc. Every factual claim needs a citation.
3. If excerpts don't answer the question, say so explicitly.
4. Be concise. Don't repeat information.
5. If sources conflict, note the discrepancy."""
 
        response = self.client.messages.create(
            model=model,
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": f"Excerpts:\n\n{context}\n\n---\n\nQuestion: {query}"}]
        )
        
        answer_text = response.content[0].text
        citations = self._extract_citations(answer_text, chunk_map)
        
        return CitedAnswer(
            answer=answer_text,
            citations=citations,
            chunks_used=[chunk_map[i] for i in chunk_map]
        )
    
    def _extract_citations(self, text: str, chunk_map: Dict[int, Chunk]) -> List[Dict]:
        refs = set(int(m) for m in re.findall(r'\[(\d+)\]', text))
        
        citations = []
        for ref in sorted(refs):
            if ref in chunk_map:
                chunk = chunk_map[ref]
                citations.append({
                    "reference": f"[{ref}]",
                    "source": chunk.source,
                    "section": chunk.section,
                    "start_char": chunk.start_char,
                    "end_char": chunk.end_char,
                    "text_preview": chunk.text[:200] + "..."
                })
            else:
                citations.append({
                    "reference": f"[{ref}]",
                    "error": "Hallucinated citation—refers to non-existent source"
                })
        
        return citations

Part 6: Putting It Together


PYTHON
# rag_search.py
from pathlib import Path
from typing import List, Optional
import os
 
from chunker import SemanticChunker, Chunk
from embedder import Embedder
from vector_store import VectorStore
from reranker import Reranker
from generator import Generator, CitedAnswer
 
class RAGSearch:
    def __init__(
        self,
        index_path: str = "./rag_index",
        anthropic_api_key: Optional[str] = None
    ):
        self.embedder = Embedder()
        self.chunker = SemanticChunker()
        self.vector_store = VectorStore(self.embedder, index_path)
        self.reranker = Reranker()
        self.generator = Generator(
            api_key=anthropic_api_key or os.environ.get("ANTHROPIC_API_KEY")
        )
        self.index_path = index_path
    
    def index_document(self, text: str, source: str) -> int:
        chunks = self.chunker.chunk(text, source)
        self.vector_store.add_documents(chunks)
        return len(chunks)
    
    def index_file(self, filepath: str) -> int:
        path = Path(filepath)
        return self.index_document(path.read_text(), path.name)
    
    def save_index(self):
        self.vector_store.save(self.index_path)
    
    def search(
        self, 
        query: str,
        retrieval_k: int = 20,
        rerank_k: int = 5
    ) -> CitedAnswer:
        retrieved = self.vector_store.search(query, k=retrieval_k)
        
        if not retrieved:
            return CitedAnswer(
                answer="No documents indexed.",
                citations=[],
                chunks_used=[]
            )
        
        reranked = self.reranker.rerank(query, retrieved, top_k=rerank_k)
        return self.generator.generate(query, reranked)

Part 7: Eval Harness (Measure or You're Guessing)

A RAG system without evaluation is a guess. Here's a minimal harness that measures what matters.


PYTHON
# eval_harness.py
from dataclasses import dataclass
from typing import List, Dict, Optional
import json
 
from rag_search import RAGSearch
 
@dataclass
class EvalCase:
    query: str
    expected_sources: List[str]      # Filenames that should be cited
    expected_sections: List[str]     # Section titles that should appear
    must_contain: List[str]          # Key phrases the answer must include
    must_not_contain: List[str]      # Phrases indicating hallucination
 
@dataclass
class EvalResult:
    query: str
    passed: bool
    source_hit: bool                 # Did we cite expected sources?
    section_hit: bool                # Did we cite expected sections?
    content_hit: bool                # Did answer contain required phrases?
    no_hallucination: bool           # Did we avoid forbidden phrases?
    details: Dict
 
class EvalHarness:
    def __init__(self, rag: RAGSearch):
        self.rag = rag
    
    def evaluate(self, cases: List[EvalCase]) -> Dict:
        results = []
        
        for case in cases:
            result = self._evaluate_single(case)
            results.append(result)
        
        # Aggregate metrics
        total = len(results)
        passed = sum(1 for r in results if r.passed)
        source_hits = sum(1 for r in results if r.source_hit)
        section_hits = sum(1 for r in results if r.section_hit)
        content_hits = sum(1 for r in results if r.content_hit)
        no_hallucinations = sum(1 for r in results if r.no_hallucination)
        
        return {
            "summary": {
                "total_cases": total,
                "passed": passed,
                "pass_rate": f"{100 * passed / total:.1f}%",
                "source_hit_rate": f"{100 * source_hits / total:.1f}%",
                "section_hit_rate": f"{100 * section_hits / total:.1f}%",
                "content_hit_rate": f"{100 * content_hits / total:.1f}%",
                "hallucination_free_rate": f"{100 * no_hallucinations / total:.1f}%"
            },
            "results": [
                {
                    "query": r.query,
                    "passed": r.passed,
                    "details": r.details
                }
                for r in results
            ]
        }
    
    def _evaluate_single(self, case: EvalCase) -> EvalResult:
        answer = self.rag.search(case.query)
        
        # Check source hits
        cited_sources = [c.get("source", "") for c in answer.citations]
        source_hit = any(
            exp in cited for exp in case.expected_sources 
            for cited in cited_sources
        ) if case.expected_sources else True
        
        # Check section hits
        cited_sections = [c.get("section", "") for c in answer.citations]
        section_hit = any(
            exp.lower() in cited.lower() for exp in case.expected_sections 
            for cited in cited_sections
        ) if case.expected_sections else True
        
        # Check content requirements
        answer_lower = answer.answer.lower()
        content_hit = all(
            phrase.lower() in answer_lower 
            for phrase in case.must_contain
        ) if case.must_contain else True
        
        # Check for hallucinations
        no_hallucination = not any(
            phrase.lower() in answer_lower 
            for phrase in case.must_not_contain
        ) if case.must_not_contain else True
        
        # Check for hallucinated citations
        hallucinated_citations = [
            c for c in answer.citations if c.get("error")
        ]
        if hallucinated_citations:
            no_hallucination = False
        
        passed = source_hit and section_hit and content_hit and no_hallucination
        
        return EvalResult(
            query=case.query,
            passed=passed,
            source_hit=source_hit,
            section_hit=section_hit,
            content_hit=content_hit,
            no_hallucination=no_hallucination,
            details={
                "answer_preview": answer.answer[:300],
                "cited_sources": cited_sources,
                "cited_sections": cited_sections,
                "hallucinated_citations": hallucinated_citations
            }
        )
 
# Example eval set
EVAL_CASES = [
    EvalCase(
        query="What is the PTO policy for full-time employees?",
        expected_sources=["employee_handbook.md"],
        expected_sections=["Benefits", "Time Off", "PTO"],
        must_contain=["days", "annual"],
        must_not_contain=["I don't know", "not mentioned"]
    ),
    EvalCase(
        query="What's the process for requesting a leave of absence?",
        expected_sources=["employee_handbook.md"],
        expected_sections=["Leave", "Absence"],
        must_contain=["request", "approval", "HR"],
        must_not_contain=[]
    ),
    EvalCase(
        query="What are the expense limits for client dinners?",
        expected_sources=["expense_policy.md"],
        expected_sections=["Entertainment", "Meals", "Client"],
        must_contain=["$", "per person"],
        must_not_contain=[]
    ),
    EvalCase(
        query="How do I report a security incident?",
        expected_sources=["security_policy.md"],
        expected_sections=["Incident", "Security"],
        must_contain=["report", "immediately"],
        must_not_contain=[]
    ),
    EvalCase(
        query="What's the dress code policy?",
        expected_sources=["employee_handbook.md"],
        expected_sections=["Dress", "Appearance"],
        must_contain=[],
        must_not_contain=["I cannot find"]
    ),
]
 
# Run evaluation
if __name__ == "__main__":
    rag = RAGSearch(index_path="./my_index")
    harness = EvalHarness(rag)
    results = harness.evaluate(EVAL_CASES)
    
    print(json.dumps(results["summary"], indent=2))
    
    # Show failures
    for r in results["results"]:
        if not r["passed"]:
            print(f"\n❌ FAILED: {r['query']}")
            print(f"   Details: {r['details']}")

Sample output:


JSON
{
  "total_cases": 5,
  "passed": 4,
  "pass_rate": "80.0%",
  "source_hit_rate": "100.0%",
  "section_hit_rate": "80.0%",
  "content_hit_rate": "100.0%",
  "hallucination_free_rate": "100.0%"
}

What this catches:

Source hit rate: Are we retrieving from the right documents?
Section hit rate: Are we finding the right part of the document?
Content hit rate: Does the answer contain expected information?
Hallucination-free rate: Are we making things up or citing phantom sources?

Run this after every change. If the pass rate drops, you broke something.

Cost Breakdown (Because Money Is Real)

Everyone pretends infrastructure is free. It isn't. Here's what this system actually costs.

One-Time Indexing Costs

Component	Cost per 1M tokens	Notes
BGE-base embeddings	$0 (local)	~3 min on M1 Mac, 10 min on CPU
OpenAI text-embedding-3-small	$0.02	If you want cloud embeddings
OpenAI text-embedding-3-large	$0.13	Higher quality, 6x cost

Example: A 500-page technical manual ≈ 250K tokens → $0.005 with OpenAI small, $0 with local BGE.

Per-Query Costs

Component	Cost	Latency
Vector search (FAISS)	$0	~5ms for 100K chunks
Reranking (local MiniLM)	$0	~50ms for 20 chunks
Reranking (Cohere API)	$0.001	~200ms, higher accuracy
Claude Sonnet generation	~$0.003-0.01	Depends on chunk size + answer length
Claude Haiku generation	~$0.0005-0.002	10x cheaper, slightly lower quality

Example query breakdown:

Query: "What's the termination policy?"
├── Vector search:     $0.000   (5ms)
├── Reranking:         $0.000   (50ms, local)
├── Generation:        $0.004   (5 chunks @ 500 tokens + 200 token response)
└── Total:             $0.004   (~300ms end-to-end)

At scale:

Queries/month	Claude Sonnet	Claude Haiku
1,000	$4	$0.50
10,000	$40	$5
100,000	$400	$50

Cost Optimization Strategies

Use Haiku for simple queries. Route based on query complexity.
Cache frequent queries. Same question = same answer.
Reduce rerank_k. 3 chunks vs 5 chunks saves 40% on generation tokens.
Batch index updates. Re-embed documents nightly, not on every change.

Tuning Guide

Chunking

Parameter	Default	Adjust when
`target_size`	512	↑ for long-form docs (contracts). ↓ for FAQs.
`max_size`	1024	↑ if using Claude (200K context).

Retrieval

Parameter	Default	Adjust when
`retrieval_k`	20	↑ if relevant docs aren't found. ↓ if reranking is slow.
`rerank_k`	5	↑ for complex questions. ↓ for cost savings.

Models

Component	Budget	Balanced	Premium
Embeddings	bge-small (33M)	bge-base (109M)	bge-large (335M)
Reranker	Skip (not recommended)	ms-marco-MiniLM (22M)	bge-reranker-base (278M)
Generator	Claude Haiku	Claude Sonnet	Claude Opus

Common Failure Modes & Fixes

Problem	Diagnosis	Fix
Relevant docs not retrieved	Run `search_raw()`, check if they appear anywhere	Increase `retrieval_k` to 50. Try larger embeddings.
Wrong docs ranked first	Vector search finds them, reranking demotes them	Upgrade reranker to bge-reranker-base.
Citations don't match claims	Model is misattributing	Increase `rerank_k`. Add "cite conservatively" to prompt.
Answers are too vague	Chunks are too short	Increase `target_size` to 768+.
"I don't have enough information" (but you do)	Relevant chunk isn't in top-k	Check reranker scores. Chunk may be split wrong.
Hallucinated citation numbers	Model generates [6] when only 5 chunks provided	Our code catches this—check `citations` for errors.

Dependencies & Installation


Bash
# requirements.txt
anthropic>=0.18.0
sentence-transformers>=2.2.0
faiss-cpu>=1.7.4
numpy>=1.24.0


Bash
pip install -r requirements.txt

For GPU: pip install faiss-gpu

What This Doesn't Cover

This is a foundation, not a production system. You'll need to add:

PDF/DOCX parsing — Use pypdf, pdfplumber, or unstructured
Authentication — Wrap endpoints with your auth layer
Streaming — Anthropic's API supports it; wire it through
Hybrid search — Combine BM25 keyword search with vector for exact-match queries
Metadata filtering — Filter by date, author, category before vector search
Query rewriting — Expand "the thing from yesterday" into searchable terms

Each is a tutorial on its own.

The Bottom Line

Most RAG fails because people skip the hard parts: intelligent chunking, retrieval-optimized embeddings, reranking, and verifiable citations.

This implementation does all of them in ~300 lines of code. No frameworks. No abstractions hiding footguns.

Build it. Eval it. Ship it.

Found a bug? Have a question? Open an issue or hit me up at [email protected]

Found this useful?

Share it with your network

Starter Kits

Build the architecture behind this article

Ship faster with production-ready Next.js + Cloudflare starter kits. Pick one path, or take the full bundle.

RAG That Doesn't Suck: Build a Search That Beats 'Chat With Your .css-1et13ue{color:var(--chakra-colors-teal-400);font-style:italic;}Docs'