January 31, 2026
25 min
Most RAG implementations are garbage. This one isn't. A comprehensive tutorial covering semantic chunking, retrieval-optimized embeddings, reranking, citation-grounded generation, and an eval harness—in ~300 lines of code.

Pio Greeff
Founder & Lead Developer
Deep dive article
Most RAG implementations are garbage. This one isn't.
Output:
That's it. Cited answers from your docs in under a minute.
Want to understand why it works? Keep reading.
You've seen the demos. Someone spins up LangChain, chunks documents at 500 characters, throws embeddings into Pinecone, and calls it "AI-powered search."
Then users ask real questions. The system returns vaguely related paragraphs. The LLM hallucinates connections. Everyone pretends it works.
Here's what's actually broken:
| Failure | What They Do | What Happens |
|---|---|---|
| Dumb chunking | Fixed 500-char splits | Sentences torn apart. "...financial statements. Chapter 4: Marketing Strategy begins with..." tells the model nothing. |
| Wrong embeddings | Generic sentence-transformers | "Contract terms?" matches "The contract was signed" because both contain "contract." One is useless. |
| No reranking | Trust vector similarity | A tangential mention ranks higher than a direct answer using different vocabulary. |
| top-k too low | Retrieve 3-5 chunks | Miss relevant context scattered across the document. |
| top-k too high | Retrieve 50 chunks | Flood the LLM with noise. It hallucinates to fill gaps. |
| No query rewriting | Pass raw user query | "What about the thing from yesterday's meeting?" retrieves nothing. |
| No metadata filtering | Search everything always | 2019 policy docs contaminate answers about 2024 procedures. |
| Phantom citations | Append "[Source: doc.pdf]" to everything | Users can't verify which passage supports which claim. Decoration, not citation. |
| No eval harness | "Looks good to me" | You have no idea if changes improve or break the system. |
This tutorial fixes all of them.
Document: A 40-page employee handbook.
Query: "Can I expense a co-working space membership?"
Retrieved chunks:
Generated answer:
Professional membership fees are reimbursable up to $500 annually [1]. Remote employees also receive a $200 home office stipend [2]. Co-working is encouraged for team collaboration [3].
What's wrong: None of these chunks answer the question. The system found semantically similar text (membership, workspace, co-working) but not the actual policy. The answer sounds plausible but doesn't address the question.
Retrieved chunks (after reranking):
Generated answer:
Yes, you can expense a co-working space membership if you're a remote employee. Full-time remote employees can expense up to $300/month, while part-time remote employees are eligible for up to $150/month [1]. You'll need manager approval and must submit monthly receipts [1]. Note: you must be designated as "remote" or "hybrid" in Workday to qualify—office-based employees aren't eligible [2].
What's different:
RAG isn't always the answer. Sometimes it's expensive overhead for a problem that doesn't need it.
Skip RAG when:
| Situation | Better Alternative |
|---|---|
| Your docs fit in context | Just stuff them in the prompt. Claude handles 200K tokens. If your entire knowledge base is under 100K tokens, RAG adds latency and complexity for no benefit. |
| Questions are predictable | Build a structured FAQ with exact-match lookup. Faster, cheaper, more reliable. |
| You need 100% accuracy | RAG is probabilistic. For compliance-critical answers, use deterministic retrieval (SQL queries, structured data) with RAG as a fallback. |
| Docs change hourly | Re-embedding is expensive. If your corpus churns constantly, consider a keyword-based system (Elasticsearch) with semantic reranking on top. |
| Users ask one-word queries | RAG needs context to work. "Benefits?" is ambiguous. Build a UI that forces specificity or use query expansion. |
Use RAG when:
Know your constraints before you build.
The key insight: vector search is coarse retrieval—it narrows millions of chunks to dozens. Reranking is fine retrieval—it identifies the actually relevant ones. The LLM is synthesis—it reads the winners and writes an answer with verifiable citations.
Forget fixed-size chunks. We split on semantic boundaries: paragraph breaks, section headers, and sentence endings. Then we merge small chunks until they hit a target size, preserving context.
Key decisions:
Not all embedding models are equal. General-purpose models optimize for similarity. Retrieval models optimize for query-document matching.
The instruction prefix matters. Without "Represent this sentence for searching relevant passages:", you lose 5-10% recall. Most tutorials skip this.
You don't need Pinecone. For under 1M chunks, numpy + faiss-cpu is faster to set up and fast enough to query.
Vector search is fuzzy. Reranking tells you which results actually answer the question.
The impact is dramatic. In testing, reranking consistently promotes correct answers from position 8-15 to position 1-3.
The model must cite specific chunks, and we verify those citations map to real sources.
A RAG system without evaluation is a guess. Here's a minimal harness that measures what matters.
Sample output:
What this catches:
Run this after every change. If the pass rate drops, you broke something.
Everyone pretends infrastructure is free. It isn't. Here's what this system actually costs.
| Component | Cost per 1M tokens | Notes |
|---|---|---|
| BGE-base embeddings | $0 (local) | ~3 min on M1 Mac, 10 min on CPU |
| OpenAI text-embedding-3-small | $0.02 | If you want cloud embeddings |
| OpenAI text-embedding-3-large | $0.13 | Higher quality, 6x cost |
Example: A 500-page technical manual ≈ 250K tokens → $0.005 with OpenAI small, $0 with local BGE.
| Component | Cost | Latency |
|---|---|---|
| Vector search (FAISS) | $0 | ~5ms for 100K chunks |
| Reranking (local MiniLM) | $0 | ~50ms for 20 chunks |
| Reranking (Cohere API) | $0.001 | ~200ms, higher accuracy |
| Claude Sonnet generation | ~$0.003-0.01 | Depends on chunk size + answer length |
| Claude Haiku generation | ~$0.0005-0.002 | 10x cheaper, slightly lower quality |
Example query breakdown:
At scale:
| Queries/month | Claude Sonnet | Claude Haiku |
|---|---|---|
| 1,000 | $4 | $0.50 |
| 10,000 | $40 | $5 |
| 100,000 | $400 | $50 |
| Parameter | Default | Adjust when |
|---|---|---|
target_size | 512 | ↑ for long-form docs (contracts). ↓ for FAQs. |
max_size | 1024 | ↑ if using Claude (200K context). |
| Parameter | Default | Adjust when |
|---|---|---|
retrieval_k | 20 | ↑ if relevant docs aren't found. ↓ if reranking is slow. |
rerank_k | 5 | ↑ for complex questions. ↓ for cost savings. |
| Component | Budget | Balanced | Premium |
|---|---|---|---|
| Embeddings | bge-small (33M) | bge-base (109M) | bge-large (335M) |
| Reranker | Skip (not recommended) | ms-marco-MiniLM (22M) | bge-reranker-base (278M) |
| Generator | Claude Haiku | Claude Sonnet | Claude Opus |
| Problem | Diagnosis | Fix |
|---|---|---|
| Relevant docs not retrieved | Run search_raw(), check if they appear anywhere | Increase retrieval_k to 50. Try larger embeddings. |
| Wrong docs ranked first | Vector search finds them, reranking demotes them | Upgrade reranker to bge-reranker-base. |
| Citations don't match claims | Model is misattributing | Increase rerank_k. Add "cite conservatively" to prompt. |
| Answers are too vague | Chunks are too short | Increase target_size to 768+. |
| "I don't have enough information" (but you do) | Relevant chunk isn't in top-k | Check reranker scores. Chunk may be split wrong. |
| Hallucinated citation numbers | Model generates [6] when only 5 chunks provided | Our code catches this—check citations for errors. |
For GPU: pip install faiss-gpu
This is a foundation, not a production system. You'll need to add:
pypdf, pdfplumber, or unstructuredEach is a tutorial on its own.
Most RAG fails because people skip the hard parts: intelligent chunking, retrieval-optimized embeddings, reranking, and verifiable citations.
This implementation does all of them in ~300 lines of code. No frameworks. No abstractions hiding footguns.
Build it. Eval it. Ship it.
Found a bug? Have a question? Open an issue or hit me up at [email protected]
Found this useful?
Share it with your network
# Clonegit clone https://github.com/yourusername/rag-that-works.gitcd rag-that-works # Installpip install -r requirements.txt # Add your API keyexport ANTHROPIC_API_KEY="sk-ant-..." # Index a document[Python](https://www.python.org/) cli.py index ./docs/employee_handbook.md # Ask a questionpython cli.py search "What's the PTO policy for contractors?"Answer:Contractors receive 15 days of paid time off annually [1], prorated based on their start date [2]. Unused PTO does not roll over to the following year [1]. Sources: [1]: employee_handbook.md (section: "Contractor Benefits", chars 2847-3201) [2]: employee_handbook.md (section: "Onboarding", chars 892-1104)[1] "...membership fees for professional organizations are reimbursable up to $500 annually. Submit receipts through Expensify within 30..." [2] "...workspace ergonomics are important. The company provides a $200 home office stipend for remote employees to purchase..." [3] "...co-working is encouraged for team collaboration. Regional teams should coordinate quarterly meetups at designated..."[1] "Co-working Space Reimbursement: Full-time remote employees may expense co-working space memberships up to $300/month. Part-time remote employees are eligible for up to $150/month. Requires manager approval and monthly receipt submission." [2] "Expense Eligibility: To qualify for workspace reimbursements, employees must be designated as 'remote' or 'hybrid' in Workday. Office-based employees are not eligible for co-working benefits."┌─────────────────────────────────────────────────────────────────┐│ USER QUERY │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ QUERY EMBEDDING ││ (same model as document chunks) │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ VECTOR RETRIEVAL (Top 20) ││ Coarse filtering—cast a wide net │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ CROSS-ENCODER RERANKING ││ Scores each chunk against query ││ Returns top 5—the actually relevant ones │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ LLM GENERATION ││ Answer with inline citations [1], [2], etc. │└─────────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ CITATION MAPPING ││ Link each [n] to exact chunk + character offset │└─────────────────────────────────────────────────────────────────┘# chunker.pyimport refrom dataclasses import dataclassfrom typing import List @dataclassclass Chunk: text: str source: str chunk_id: int start_char: int end_char: int section: str = "" # Optional: section header for context class SemanticChunker: def __init__( self, target_size: int = 512, # target tokens (roughly) max_size: int = 1024, # hard ceiling overlap_sentences: int = 1 # context carry-over ): self.target_size = target_size self.max_size = max_size self.overlap_sentences = overlap_sentences self.chars_per_token = 4 # Rough estimate for English def chunk(self, text: str, source: str) -> List[Chunk]: """ Split text into semantic chunks. Strategy: 1. Split on major boundaries (double newlines, headers) 2. Split long sections on sentence boundaries 3. Merge tiny chunks with neighbors 4. Track section headers for citation context """ text = re.sub(r'\n{3,}', '\n\n', text.strip()) # Extract section structure sections = self._split_by_headers(text) all_chunks = [] for section_title, section_text in sections: section_chunks = self._chunk_section(section_text, source, section_title) all_chunks.extend(section_chunks) # Assign sequential IDs and character offsets char_offset = 0 for i, chunk in enumerate(all_chunks): chunk.chunk_id = i start = text.find(chunk.text[:50], char_offset) if start == -1: start = char_offset chunk.start_char = start chunk.end_char = start + len(chunk.text) char_offset = start + len(chunk.text) // 2 return all_chunks def _split_by_headers(self, text: str) -> List[tuple]: """Split text by headers, returning (header, content) tuples.""" # Match markdown headers or ALL CAPS lines header_pattern = r'^(#{1,6}\s+.+|[A-Z][A-Z\s]{10,})$' sections = [] current_header = "" current_content = [] for line in text.split('\n'): if re.match(header_pattern, line.strip()): if current_content: sections.append((current_header, '\n'.join(current_content))) current_header = line.strip().lstrip('#').strip() current_content = [] else: current_content.append(line) if current_content: sections.append((current_header, '\n'.join(current_content))) return sections if sections else [("", text)] def _chunk_section(self, text: str, source: str, section: str) -> List[Chunk]: """Chunk a single section.""" paragraphs = re.split(r'\n\n+', text) # Split large paragraphs on sentences pieces = [] for para in paragraphs: if self._estimate_tokens(para) > self.max_size: pieces.extend(self._split_sentences(para)) else: pieces.append(para) # Merge small pieces merged = self._merge_pieces(pieces) return [ Chunk(text=t.strip(), source=source, chunk_id=0, start_char=0, end_char=0, section=section) for t in merged if t.strip() ] def _estimate_tokens(self, text: str) -> int: return len(text) // self.chars_per_token def _split_sentences(self, text: str) -> List[str]: pattern = r'(?<=[.!?])\s+(?=[A-Z])' return [s.strip() for s in re.split(pattern, text) if s.strip()] def _merge_pieces(self, pieces: List[str]) -> List[str]: if not pieces: return [] merged = [] current = pieces[0] for piece in pieces[1:]: combined_size = self._estimate_tokens(current + " " + piece) if combined_size <= self.target_size: current = current + "\n\n" + piece elif self._estimate_tokens(current) < self.target_size // 3: current = current + "\n\n" + piece else: merged.append(current) current = piece merged.append(current) return merged# embedder.pyimport numpy as npfrom typing import Listfrom sentence_transformers import SentenceTransformer class Embedder: def __init__(self, model_name: str = "BAAI/bge-base-en-v1.5"): """ BGE-base: trained on retrieval tasks, handles query/doc asymmetry, 768 dimensions, instruction-following for query enhancement. """ self.model = SentenceTransformer(model_name) self.dimension = 768 def embed_documents(self, texts: List[str]) -> np.ndarray: """Embed document chunks. No prefix needed.""" return self.model.encode( texts, normalize_embeddings=True, show_progress_bar=True ) def embed_query(self, query: str) -> np.ndarray: """ Embed a search query with instruction prefix. This is critical: BGE models gain 5-10% recall with the prefix. """ instruction = "Represent this sentence for searching relevant passages: " return self.model.encode( instruction + query, normalize_embeddings=True )# vector_store.pyimport numpy as npimport faissimport picklefrom pathlib import Pathfrom typing import List, Tuple, Optionalfrom dataclasses import asdict from chunker import Chunkfrom embedder import Embedder class VectorStore: def __init__(self, embedder: Embedder, index_path: Optional[str] = None): self.embedder = embedder self.index: Optional[faiss.IndexFlatIP] = None self.chunks: List[Chunk] = [] self.index_path = index_path if index_path and Path(index_path).exists(): self.load(index_path) def add_documents(self, chunks: List[Chunk]): if not chunks: return texts = [c.text for c in chunks] embeddings = self.embedder.embed_documents(texts) if self.index is None: self.index = faiss.IndexFlatIP(embeddings.shape[1]) self.index.add(embeddings.astype(np.float32)) self.chunks.extend(chunks) def search(self, query: str, k: int = 20) -> List[Tuple[Chunk, float]]: if self.index is None or not self.chunks: return [] query_emb = self.embedder.embed_query(query).reshape(1, -1).astype(np.float32) scores, indices = self.index.search(query_emb, min(k, len(self.chunks))) return [(self.chunks[idx], float(score)) for score, idx in zip(scores[0], indices[0]) if idx < len(self.chunks)] def save(self, path: str): path = Path(path) path.mkdir(parents=True, exist_ok=True) if self.index: faiss.write_index(self.index, str(path / "index.faiss")) with open(path / "chunks.pkl", "wb") as f: pickle.dump([asdict(c) for c in self.chunks], f) def load(self, path: str): path = Path(path) if (path / "index.faiss").exists(): self.index = faiss.read_index(str(path / "index.faiss")) if (path / "chunks.pkl").exists(): with open(path / "chunks.pkl", "rb") as f: self.chunks = [Chunk(**d) for d in pickle.load(f)]# reranker.pyfrom typing import List, Tuplefrom sentence_transformers import CrossEncoder from chunker import Chunk class Reranker: def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"): """ Cross-encoder: processes query+doc together, catches interactions bi-encoders miss. Slower but far more accurate. For critical accuracy: use BAAI/bge-reranker-base (10x larger) """ self.model = CrossEncoder(model_name) def rerank( self, query: str, chunks: List[Tuple[Chunk, float]], top_k: int = 5 ) -> List[Tuple[Chunk, float]]: if not chunks: return [] pairs = [(query, chunk.text) for chunk, _ in chunks] scores = self.model.predict(pairs) reranked = [(chunks[i][0], float(scores[i])) for i in range(len(chunks))] reranked.sort(key=lambda x: x[1], reverse=True) return reranked[:top_k]# generator.pyimport refrom typing import List, Tuple, Dict, Anyfrom dataclasses import dataclassimport anthropic from chunker import Chunk @dataclassclass CitedAnswer: answer: str citations: List[Dict[str, Any]] chunks_used: List[Chunk] class Generator: def __init__(self, api_key: str): self.client = anthropic.Anthropic(api_key=api_key) def generate( self, query: str, chunks: List[Tuple[Chunk, float]], model: str = "claude-sonnet-4-20250514" ) -> CitedAnswer: # Build numbered context context_parts = [] chunk_map = {} for i, (chunk, score) in enumerate(chunks, 1): section_info = f" (section: \"{chunk.section}\")" if chunk.section else "" context_parts.append( f"[{i}] Source: {chunk.source}{section_info}\n{chunk.text}" ) chunk_map[i] = chunk context = "\n\n---\n\n".join(context_parts) system_prompt = """You are a precise research assistant. Answer using ONLY the provided excerpts. Rules:1. Use ONLY information from the excerpts. No outside knowledge.2. Cite with [1], [2], etc. Every factual claim needs a citation.3. If excerpts don't answer the question, say so explicitly.4. Be concise. Don't repeat information.5. If sources conflict, note the discrepancy.""" response = self.client.messages.create( model=model, max_tokens=1024, system=system_prompt, messages=[{"role": "user", "content": f"Excerpts:\n\n{context}\n\n---\n\nQuestion: {query}"}] ) answer_text = response.content[0].text citations = self._extract_citations(answer_text, chunk_map) return CitedAnswer( answer=answer_text, citations=citations, chunks_used=[chunk_map[i] for i in chunk_map] ) def _extract_citations(self, text: str, chunk_map: Dict[int, Chunk]) -> List[Dict]: refs = set(int(m) for m in re.findall(r'\[(\d+)\]', text)) citations = [] for ref in sorted(refs): if ref in chunk_map: chunk = chunk_map[ref] citations.append({ "reference": f"[{ref}]", "source": chunk.source, "section": chunk.section, "start_char": chunk.start_char, "end_char": chunk.end_char, "text_preview": chunk.text[:200] + "..." }) else: citations.append({ "reference": f"[{ref}]", "error": "Hallucinated citation—refers to non-existent source" }) return citations# rag_search.pyfrom pathlib import Pathfrom typing import List, Optionalimport os from chunker import SemanticChunker, Chunkfrom embedder import Embedderfrom vector_store import VectorStorefrom reranker import Rerankerfrom generator import Generator, CitedAnswer class RAGSearch: def __init__( self, index_path: str = "./rag_index", anthropic_api_key: Optional[str] = None ): self.embedder = Embedder() self.chunker = SemanticChunker() self.vector_store = VectorStore(self.embedder, index_path) self.reranker = Reranker() self.generator = Generator( api_key=anthropic_api_key or os.environ.get("ANTHROPIC_API_KEY") ) self.index_path = index_path def index_document(self, text: str, source: str) -> int: chunks = self.chunker.chunk(text, source) self.vector_store.add_documents(chunks) return len(chunks) def index_file(self, filepath: str) -> int: path = Path(filepath) return self.index_document(path.read_text(), path.name) def save_index(self): self.vector_store.save(self.index_path) def search( self, query: str, retrieval_k: int = 20, rerank_k: int = 5 ) -> CitedAnswer: retrieved = self.vector_store.search(query, k=retrieval_k) if not retrieved: return CitedAnswer( answer="No documents indexed.", citations=[], chunks_used=[] ) reranked = self.reranker.rerank(query, retrieved, top_k=rerank_k) return self.generator.generate(query, reranked)# eval_harness.pyfrom dataclasses import dataclassfrom typing import List, Dict, Optionalimport json from rag_search import RAGSearch @dataclassclass EvalCase: query: str expected_sources: List[str] # Filenames that should be cited expected_sections: List[str] # Section titles that should appear must_contain: List[str] # Key phrases the answer must include must_not_contain: List[str] # Phrases indicating hallucination @dataclassclass EvalResult: query: str passed: bool source_hit: bool # Did we cite expected sources? section_hit: bool # Did we cite expected sections? content_hit: bool # Did answer contain required phrases? no_hallucination: bool # Did we avoid forbidden phrases? details: Dict class EvalHarness: def __init__(self, rag: RAGSearch): self.rag = rag def evaluate(self, cases: List[EvalCase]) -> Dict: results = [] for case in cases: result = self._evaluate_single(case) results.append(result) # Aggregate metrics total = len(results) passed = sum(1 for r in results if r.passed) source_hits = sum(1 for r in results if r.source_hit) section_hits = sum(1 for r in results if r.section_hit) content_hits = sum(1 for r in results if r.content_hit) no_hallucinations = sum(1 for r in results if r.no_hallucination) return { "summary": { "total_cases": total, "passed": passed, "pass_rate": f"{100 * passed / total:.1f}%", "source_hit_rate": f"{100 * source_hits / total:.1f}%", "section_hit_rate": f"{100 * section_hits / total:.1f}%", "content_hit_rate": f"{100 * content_hits / total:.1f}%", "hallucination_free_rate": f"{100 * no_hallucinations / total:.1f}%" }, "results": [ { "query": r.query, "passed": r.passed, "details": r.details } for r in results ] } def _evaluate_single(self, case: EvalCase) -> EvalResult: answer = self.rag.search(case.query) # Check source hits cited_sources = [c.get("source", "") for c in answer.citations] source_hit = any( exp in cited for exp in case.expected_sources for cited in cited_sources ) if case.expected_sources else True # Check section hits cited_sections = [c.get("section", "") for c in answer.citations] section_hit = any( exp.lower() in cited.lower() for exp in case.expected_sections for cited in cited_sections ) if case.expected_sections else True # Check content requirements answer_lower = answer.answer.lower() content_hit = all( phrase.lower() in answer_lower for phrase in case.must_contain ) if case.must_contain else True # Check for hallucinations no_hallucination = not any( phrase.lower() in answer_lower for phrase in case.must_not_contain ) if case.must_not_contain else True # Check for hallucinated citations hallucinated_citations = [ c for c in answer.citations if c.get("error") ] if hallucinated_citations: no_hallucination = False passed = source_hit and section_hit and content_hit and no_hallucination return EvalResult( query=case.query, passed=passed, source_hit=source_hit, section_hit=section_hit, content_hit=content_hit, no_hallucination=no_hallucination, details={ "answer_preview": answer.answer[:300], "cited_sources": cited_sources, "cited_sections": cited_sections, "hallucinated_citations": hallucinated_citations } ) # Example eval setEVAL_CASES = [ EvalCase( query="What is the PTO policy for full-time employees?", expected_sources=["employee_handbook.md"], expected_sections=["Benefits", "Time Off", "PTO"], must_contain=["days", "annual"], must_not_contain=["I don't know", "not mentioned"] ), EvalCase( query="What's the process for requesting a leave of absence?", expected_sources=["employee_handbook.md"], expected_sections=["Leave", "Absence"], must_contain=["request", "approval", "HR"], must_not_contain=[] ), EvalCase( query="What are the expense limits for client dinners?", expected_sources=["expense_policy.md"], expected_sections=["Entertainment", "Meals", "Client"], must_contain=["$", "per person"], must_not_contain=[] ), EvalCase( query="How do I report a security incident?", expected_sources=["security_policy.md"], expected_sections=["Incident", "Security"], must_contain=["report", "immediately"], must_not_contain=[] ), EvalCase( query="What's the dress code policy?", expected_sources=["employee_handbook.md"], expected_sections=["Dress", "Appearance"], must_contain=[], must_not_contain=["I cannot find"] ),] # Run evaluationif __name__ == "__main__": rag = RAGSearch(index_path="./my_index") harness = EvalHarness(rag) results = harness.evaluate(EVAL_CASES) print(json.dumps(results["summary"], indent=2)) # Show failures for r in results["results"]: if not r["passed"]: print(f"\n❌ FAILED: {r['query']}") print(f" Details: {r['details']}"){ "total_cases": 5, "passed": 4, "pass_rate": "80.0%", "source_hit_rate": "100.0%", "section_hit_rate": "80.0%", "content_hit_rate": "100.0%", "hallucination_free_rate": "100.0%"}Query: "What's the termination policy?"
├── Vector search: $0.000 (5ms)
├── Reranking: $0.000 (50ms, local)
├── Generation: $0.004 (5 chunks @ 500 tokens + 200 token response)
└── Total: $0.004 (~300ms end-to-end)
# requirements.txtanthropic>=0.18.0sentence-transformers>=2.2.0faiss-cpu>=1.7.4numpy>=1.24.0pip install -r requirements.txt