What is RAG (Retrieval-Augmented Generation) and why does it matter?

RAG is an architecture pattern that enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on knowledge encoded in model weights during training, RAG systems search through your documents, databases, and knowledge bases to find relevant context, then provide that context to the LLM along with the user's question. This dramatically reduces hallucinations, keeps responses grounded in your actual data, allows knowledge updates without model retraining, and provides source attribution for generated answers.

What is the best chunking strategy for RAG applications?

There is no single best chunking strategy — the optimal approach depends on your document types and query patterns. Recursive character splitting (256-512 tokens with 50-token overlap) works well for general text. Semantic chunking that splits on topic boundaries produces more coherent chunks for long documents. Sentence-level chunking works best for FAQ-style content. Document-structure-aware chunking that respects headings, sections, and paragraphs preserves context for structured documents like technical manuals. The most effective production systems use different strategies for different document types and evaluate chunk quality empirically against retrieval accuracy metrics.

What is hybrid search and why is it better than pure vector search for RAG?

Hybrid search combines semantic vector search with traditional keyword (BM25) search, using each method's strengths to compensate for the other's weaknesses. Vector search excels at finding semantically similar content even when different words are used, but can miss exact matches on specific terms, product names, or codes. Keyword search finds exact term matches reliably but misses semantically related content using different vocabulary. Combining both with reciprocal rank fusion or learned score weighting typically improves retrieval accuracy by 10-25% compared to either method alone, particularly for queries mixing conceptual questions with specific terminology.

RAG Application Development — Retrieval-Augmented Generation Done Right

Q: Which vector database should I use for RAG?

The choice depends on scale, features, and operational requirements. Pinecone offers a fully managed service with excellent scaling and minimal operational overhead. Weaviate provides hybrid search combining vector and keyword matching natively. Qdrant offers high performance with rich filtering capabilities. Chroma is the simplest option for prototyping and small deployments. pgvector adds vector search to existing PostgreSQL databases, avoiding a new infrastructure component. For most production RAG applications processing under 10 million documents, pgvector or Qdrant provide the best balance of capability and operational simplicity. Above that scale, Pinecone or Weaviate's managed offerings reduce operational burden.

Q: How do you evaluate RAG application quality?

RAG evaluation requires measuring both retrieval quality and generation quality independently. Retrieval metrics include recall at k (what percentage of relevant documents are retrieved), precision at k (what percentage of retrieved documents are relevant), and mean reciprocal rank. Generation metrics include faithfulness (does the answer stay grounded in retrieved context), answer relevancy (does the answer address the question), and completeness (does the answer cover all relevant information). The RAGAS framework provides automated evaluation using LLM-as-judge for these dimensions. Production systems should also track user feedback signals, citation accuracy, and hallucination rates as ongoing quality indicators.

April 1, 2026 Blog | LLM & AI Engineering 15 min read

RAG Application Development — Retrieval-Augmented Generation Done Right

Retrieval-Augmented Generation has become the default architecture for connecting LLMs to proprietary data. The concept is simple: instead of relying solely on what the model learned during pretraining, retrieve relevant information from your knowledge base and include it in the prompt. The execution is anything but simple. The difference between a RAG prototype that works on demo day and a RAG system that works reliably in production comes down to dozens of engineering decisions about chunking, embedding, retrieval, reranking, and generation that compound to determine overall system quality.

Most RAG projects start well and stall in the same place. The team builds a basic pipeline — chunk documents, embed them, store in a vector database, retrieve top-k results, pass to the LLM — and gets encouraging initial results. Then they test with real user queries and discover that retrieval misses relevant documents, retrieved chunks lack sufficient context, the model hallucinates despite having correct context, and answers are inconsistent across similar queries. These are not signs that RAG does not work. They are signs that the basic pipeline needs the engineering refinements that separate prototypes from products.

At ESS ENN Associates, we have been building software systems for global clients since 1993. Our AI engineering practice has shipped RAG applications for knowledge management, customer support, legal research, and technical documentation. This guide covers the architecture decisions, optimization techniques, and evaluation practices that make RAG applications reliable in production.

Chunking: The Foundation of Retrieval Quality

Chunking — how you split your documents into pieces for embedding and retrieval — is the single most impactful decision in RAG pipeline design. Get chunking wrong and no amount of embedding model quality or retrieval sophistication will compensate. The chunk must be large enough to contain meaningful, self-contained information but small enough to be relevant to specific queries and fit within the LLM's context window alongside other chunks.

Fixed-size chunking with recursive character splitting is the simplest approach. You define a target chunk size (typically 256-512 tokens) and an overlap (50-100 tokens), and the splitter breaks text at natural boundaries (paragraph breaks, sentence ends) as close to the target size as possible. The overlap ensures that information spanning chunk boundaries is not lost. This approach works reasonably well for homogeneous text and is the right starting point for any RAG project.

Semantic chunking uses embedding similarity to identify topic boundaries within documents. Instead of splitting at fixed intervals, the system computes embeddings for each sentence or paragraph, measures similarity between consecutive segments, and splits where similarity drops below a threshold (indicating a topic change). This produces chunks that are topically coherent, which improves retrieval precision because each chunk is about one thing rather than straddling two topics.

Document-structure-aware chunking uses the document's native structure — headings, sections, tables, lists — to define chunk boundaries. Technical manuals, legal contracts, and academic papers have explicit structure that communicates topical organization. Respecting this structure produces more meaningful chunks than any algorithm that treats the document as flat text. This approach requires document parsing that extracts structural metadata, which adds pipeline complexity but significantly improves retrieval quality for structured documents.

Hierarchical chunking stores documents at multiple granularity levels simultaneously: full sections, individual paragraphs, and even individual sentences. The retrieval system can then match queries at the appropriate level — specific factual queries retrieve sentence-level chunks while broad conceptual queries retrieve section-level chunks. This approach requires more storage and index management but provides the best retrieval precision across diverse query types.

Embedding Models: Choosing and Optimizing

The embedding model converts text chunks and queries into vectors that capture semantic meaning. The quality of these embeddings directly determines retrieval accuracy — if similar concepts are not close in embedding space, retrieval fails regardless of everything else in the pipeline.

Model selection in 2026 offers strong options across the performance spectrum. OpenAI's text-embedding-3-large provides excellent quality through a managed API. Cohere's Embed v3 offers multilingual strength and compression flexibility. For self-hosted deployment, BGE-M3, E5-Mistral-7B, and GTE-Qwen2 provide state-of-the-art quality without API dependency. The MTEB leaderboard provides benchmarks, but the most meaningful evaluation is always against your own data — retrieval accuracy on your documents with your queries, not generic benchmarks.

Embedding dimensions involve a tradeoff between quality and storage/compute. Higher dimensions (1536, 3072) capture more nuance but require more storage and slower similarity computation. Lower dimensions (384, 768) are more efficient but may lose subtle semantic distinctions. For most RAG applications, 768-1024 dimensions provide an excellent balance. Models like text-embedding-3-large support dimension reduction through Matryoshka representations, letting you choose your dimension at inference time without retraining.

Fine-tuned embeddings significantly improve retrieval accuracy for domain-specific applications. The process uses contrastive learning on pairs of queries and relevant documents from your domain to adjust the embedding space. A fine-tuned embedding model that understands your domain's terminology and conceptual relationships consistently outperforms a general-purpose model. The training data can be generated synthetically: use an LLM to generate questions that each document chunk answers, creating query-document pairs for contrastive training.

Vector Databases: Storage and Retrieval Infrastructure

The vector database stores your embeddings and provides fast similarity search. The choice of vector database affects performance, scalability, operational complexity, and available search features.

Pinecone is the leading managed vector database. Its fully managed infrastructure means zero operational overhead — no capacity planning, index management, or upgrade maintenance. It scales seamlessly and provides consistent performance. The tradeoff is cost (managed services are more expensive than self-hosted) and vendor lock-in. Pinecone is the right choice for teams that want to focus on application logic rather than infrastructure management.

Weaviate provides hybrid search natively, combining vector similarity with BM25 keyword search in a single query. This hybrid capability is valuable because it eliminates the need to maintain separate search systems. Weaviate also supports generative search (running LLM generation directly within the database query) and multi-tenancy for SaaS applications. Available as both a managed cloud service and self-hosted deployment.

Qdrant offers high performance with rich metadata filtering. Its filtering capabilities allow complex queries that combine vector similarity with attribute-based constraints (retrieve documents similar to this query AND published after 2025 AND in the medical domain). Written in Rust for performance, Qdrant handles large-scale deployments efficiently. Available as managed cloud and self-hosted.

pgvector adds vector search to PostgreSQL, which is already running in most application stacks. For teams that want to avoid adding a new infrastructure component, pgvector provides good vector search performance within the familiar PostgreSQL ecosystem. It supports exact and approximate nearest neighbor search, integrates with PostgreSQL's existing indexing and query optimization, and benefits from PostgreSQL's mature operational tooling. For RAG applications with under 5-10 million vectors, pgvector often provides the simplest path to production.

Hybrid Search and Reranking

Hybrid search combines semantic vector search with keyword-based BM25 search. The motivation is that each method has complementary strengths. Vector search finds conceptually relevant results even when different words are used, but can miss exact matches on specific terms, identifiers, and proper nouns. BM25 search finds exact term matches reliably but misses semantically related content that uses different vocabulary. Combining both approaches with reciprocal rank fusion or learned score weighting typically improves retrieval recall by 10-25% compared to either method alone.

The implementation approach depends on your vector database. Weaviate supports hybrid search natively. For other databases, you run parallel vector and keyword searches, then merge results using reciprocal rank fusion (RRF). RRF assigns each result a score based on its rank in each search result list, then sorts by combined score. The formula is simple, requires no training, and works surprisingly well across diverse query types.

Reranking is a second-stage retrieval step that improves the precision of the initial retrieval results. The first-stage retrieval (vector search, keyword search, or hybrid) returns a candidate set of perhaps 20-50 chunks. The reranker then scores each candidate against the original query using a more sophisticated model than embedding similarity, and returns the top results ranked by relevance.

Cross-encoder rerankers like Cohere Rerank, BGE-reranker, and ColBERT provide significantly more accurate relevance scores than embedding cosine similarity because they process the query and document jointly rather than independently. The tradeoff is latency: reranking adds 50-200ms to the retrieval pipeline. For most production applications, this latency is acceptable and the retrieval quality improvement (typically 5-15% on recall metrics) justifies the cost.

Advanced RAG Patterns

Query transformation improves retrieval by reformulating the user's query before searching. Techniques include HyDE (Hypothetical Document Embeddings), which generates a hypothetical answer to the query and uses its embedding for retrieval, capturing the semantics of what a good answer looks like. Multi-query generation creates multiple reformulations of the user's query to increase recall. Step-back prompting generates a more general version of specific queries to retrieve broader contextual information.

Contextual compression extracts only the relevant portions of retrieved chunks before passing them to the LLM. Retrieved chunks often contain both relevant and irrelevant information. A compression step (using a smaller LLM or an extraction model) identifies and extracts only the sentences or paragraphs that are relevant to the specific query, reducing noise in the generation context and improving answer accuracy.

Self-RAG and Corrective RAG add self-reflection to the RAG pipeline. After generating an initial answer, the system evaluates whether the answer is grounded in the retrieved context and whether additional retrieval is needed. If the answer fails grounding checks, the system performs corrective retrieval with refined queries and regenerates. This iterative approach catches and corrects hallucinations that would slip through a single-pass pipeline.

Graph RAG combines vector retrieval with knowledge graph traversal. Documents are not just chunked and embedded — entities and relationships are extracted to build a knowledge graph. Retrieval then follows graph relationships to find relevant context that might not be semantically similar to the query but is logically connected through entity relationships. This is particularly valuable for complex questions that require connecting information across multiple documents.

RAG Evaluation with RAGAS and Beyond

RAG systems require evaluation at two levels: retrieval quality (are we finding the right documents?) and generation quality (are we producing the right answers?). Evaluating only the final answer misses the opportunity to diagnose whether failures originate in retrieval or generation, which is essential for targeted improvement.

RAGAS (Retrieval Augmented Generation Assessment) is the leading evaluation framework for RAG applications. It provides automated metrics for faithfulness (does the answer stay grounded in retrieved context?), answer relevancy (does the answer address the question?), context precision (are the retrieved documents relevant?), and context recall (were all relevant documents retrieved?). RAGAS uses LLM-as-judge evaluation, meaning it does not require human-labeled ground truth for every test case — though a human-labeled subset for calibration is strongly recommended.

Building your evaluation set requires at minimum 100-200 question-answer-context triplets that represent your production query distribution. Include questions of varying complexity, questions that require information from multiple documents, questions where the answer is not in your knowledge base (to test appropriate refusal), and adversarial questions designed to trigger hallucination. This evaluation set should be maintained as a living asset that grows as new failure modes are discovered in production.

For a comprehensive treatment of evaluation methodology that applies beyond RAG to all LLM applications, see our LLM evaluation and benchmarking guide.

"The most common RAG failure is not a technology problem. It is an evaluation problem. Teams that build RAG pipelines without systematic retrieval metrics end up tuning generation prompts to compensate for retrieval failures, which is like adjusting the thermostat because the window is open. Measure retrieval quality first. Fix retrieval first. Then optimize generation."

— Karan Checker, Founder, ESS ENN Associates

Production RAG Optimization

Caching at multiple levels reduces latency and cost. Embedding cache stores computed embeddings for frequently seen queries. Retrieval cache stores search results for common queries. Answer cache stores complete responses for repeated identical queries. Semantic caching extends this by identifying queries that are similar enough to share cached results, using embedding distance as the similarity measure. A well-tuned caching layer can reduce LLM API calls by 30-60% for applications with repetitive query patterns.

Metadata filtering reduces the search space and improves relevance. Rather than searching the entire knowledge base for every query, metadata filters constrain retrieval to relevant subsets: documents from the correct department, documents within the relevant date range, documents at the appropriate security classification. Filtering before vector search is dramatically faster than retrieving from the full index and filtering afterward. Structured metadata also enables access control, ensuring users only receive information they are authorized to see.

Document freshness management ensures the knowledge base reflects current information. Implement automated pipelines that detect document changes, recompute embeddings for modified content, and update the vector index. Track document versions so that retrieval can reference the specific version used for each answer, providing an audit trail for compliance-sensitive applications. Stale data in the knowledge base is a common source of incorrect RAG answers that is easily preventable with proper data management.

For teams deploying RAG at scale, our LLM deployment infrastructure guide covers the serving infrastructure needed for production RAG applications, including GPU selection, auto-scaling, and cost management for the LLM component of the pipeline.

Frequently Asked Questions

What is RAG and why does it matter?

RAG enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on trained knowledge, RAG searches your documents and databases for relevant context. This reduces hallucinations, keeps responses grounded in your data, allows knowledge updates without retraining, and provides source attribution for answers.

What is the best chunking strategy for RAG?

The optimal strategy depends on document types and query patterns. Recursive character splitting (256-512 tokens with overlap) works for general text. Semantic chunking splits on topic boundaries for long documents. Document-structure-aware chunking respects headings and sections for structured content. The most effective systems use different strategies per document type and evaluate empirically against retrieval accuracy.

Which vector database should I use for RAG?

Pinecone offers fully managed scaling with minimal overhead. Weaviate provides native hybrid search. Qdrant offers high performance with rich filtering. pgvector adds vector search to existing PostgreSQL. For under 10 million documents, pgvector or Qdrant provide the best balance. Above that, Pinecone or Weaviate managed services reduce operational burden.

What is hybrid search and why is it better for RAG?

Hybrid search combines semantic vector search with keyword BM25 search. Vector search finds conceptually similar content; keyword search finds exact term matches. Combining both with reciprocal rank fusion improves retrieval accuracy by 10-25%, particularly for queries mixing conceptual questions with specific terminology, product names, or codes.

How do you evaluate RAG application quality?

Evaluate both retrieval and generation independently. Retrieval metrics include recall and precision at k. Generation metrics include faithfulness, answer relevancy, and completeness. The RAGAS framework automates these using LLM-as-judge evaluation. Production systems should also track user feedback, citation accuracy, and hallucination rates as ongoing quality indicators.

At ESS ENN Associates, our AI engineering services team builds production RAG applications with the retrieval sophistication and evaluation rigor described in this guide. We bring 30+ years of software delivery experience to every engagement, operating on dedicated GPU infrastructure for embedding and inference workloads. If you are building a RAG application and want to discuss architecture, chunking strategy, or evaluation methodology, contact us for a free technical consultation.