Most RAG tutorials show you the same thing: split your documents into fixed-size chunks, embed them, store in a vector database, retrieve the top-k on query. This works in demos. It fails in production.
The problem is that real documents are structurally complex. Legal contracts have defined sections. Technical manuals have nested hierarchies. Financial filings have tables, footnotes, and cross-references. Fixed-size chunking destroys this structure and produces low-quality retrieval.
Semantic chunking
Instead of splitting at character boundaries, semantic chunking identifies natural content boundaries in the document. Section headers, paragraph breaks, list items, and table cells are treated as chunk boundaries. This produces chunks that contain coherent, self-contained pieces of information.
Metadata filtering
Before vector search, we apply metadata filters to reduce the candidate set. For a question about a specific policy version, we filter to chunks from that document version before running the similarity search. This dramatically improves both precision and latency.
Hybrid search
Dense vector search is good at semantic similarity but poor at exact term matching. For queries containing specific identifiers, product names, or technical terms, BM25 keyword search often outperforms embedding-based retrieval. Our pipeline runs both in parallel and combines results using reciprocal rank fusion.
Re-ranking
After the initial retrieval pass, we run a cross-encoder re-ranker over the top-20 candidates to produce a final top-k. Cross-encoders are too slow to run over the full corpus, but excellent at discriminating between the near-misses that dense retrieval produces. This step consistently improves end-to-end answer quality.
Every improvement in retrieval quality directly improves answer quality. It's the highest-leverage layer of the RAG stack to optimise.