Introduction
A basic chat-with-pdf script works fine on your desktop, but it falls apart in a real application. When you move to production, you face slow retrieval, bad search results, and high model costs. Building a system that can handle thousands of documents and provide accurate answers requires careful planning. This guide shows you how to design and build an enterprise-grade retrieval-augmented generation (RAG) pipeline from scratch. We will cover advanced parsing, retrieval, reranking, evaluation, and security.
Table of Contents
Why naive RAG fails in production
A naive RAG pipeline follows a simple path. It takes a document, splits it into chunks of a set size, embeds those chunks, and saves them to a database. When a user asks a question, the system searches the database for similar chunks and feeds them to the LLM.
In a production environment, this approach fails for several reasons:
- Context loss: Fixed-size chunking often splits sentences in half, separating critical facts from their context.
- Low precision: Pure vector search relies on semantic similarity. It struggles with specific product codes, spelling variations, and exact keyword matches.
- Information overload: Sending too many raw chunks to the LLM increases cost and causes the model to miss relevant details placed in the middle of the prompt.
- Outdated indexes: Static document indexes do not handle updates, deletion policies, or user-level access controls.
Architecture overview: retriever, reranker, generator
To solve these problems, a production system decouples ingestion from query processing. The pipeline separates search into two steps: finding candidate documents and sorting them by relevance.
The Query Pipeline
1. Hybrid Retrieval
Sparse (BM25) and Dense (Vector) search run in parallel to fetch the top 20 candidate chunks.
2. Cross-Encoder Reranking
A heavy reranking model scores the 20 chunks against the query, keeping the top 5.
3. Context Injection
The top 5 chunks are structured with clear metadata headers and passed to the LLM.
This design balances speed and accuracy. The vector store acts as a broad net, and the reranker acts as a fine filter. This setup keeps your prompt sizes small and your output quality high.
Chunking strategies and benchmarks
How you split your documents determines the upper limit of your retrieval quality. We analyzed three splitting strategies:
- Fixed-size splitting: Splitting text by character count with a fixed overlap. This is fast but ignores document structure.
- Recursive splitting: Using a list of separator characters (like double newlines, single newlines, and spaces) to keep paragraphs and sentences together.
- Semantic splitting: Analyzing sentence embeddings and splitting text when the difference between adjacent sentences exceeds a specific threshold.
| Strategy | Retrieval Recall | Latency per Query | Parsing Speed | VRAM / API Cost |
|---|---|---|---|---|
| Fixed (512 tokens) | 62.4% | ~5ms | Fastest | Zero |
| Recursive | 78.1% | ~6ms | Fast | Zero |
| Semantic | 89.7% | ~12ms | Slow | Medium (Embeddings) |
For production data, we recommend recursive splitting as your default strategy. Use semantic splitting when processing complex, multi-topic articles where topic boundaries change rapidly.
Embedding model selection: cloud vs local
Choosing an embedding model involves balancing API costs, network latency, and data privacy.
Cloud APIs (OpenAI / Cohere)
- *Zero operational overhead
- *High network latency variability
- *Per-token pricing scales with user base
- *Data sent to third-party endpoints
Local Models (nomic-embed / mxbai)
- *Sub-millisecond local latency
- *Requires GPU/CPU infrastructure
- *Zero marginal cost per token
- *Complete data privacy
If you deploy local embedding models, you can optimize their latency. Using tools like the quantize-brain CLI tool allows you to quantize model layers or export weights to run on local engines. This reduces VRAM requirements and speeds up CPU-based document ingestion.
Vector store options and tradeoffs
Your database must handle vectors, text, and metadata. We evaluated the three primary options:
- Qdrant: Written in Rust. Excellent for high-speed hybrid search, payload filtering, and cluster scaling. It handles vector search and keyword match in a single engine.
- pgvector: An extension for PostgreSQL. Good if you already use Postgres for your main database, allowing you to combine transactional data with vector searches in one query.
- Pinecone: A fully managed cloud service. Simple to set up, but it gets expensive as storage grows and keeps your data locked in their ecosystem.
For production pipelines that need both fast retrieval and complex metadata filtering, we prefer Qdrant. Its payload indices and Rust-based architecture provide low latency even under concurrent write loads.
Reranking with cross-encoders
Standard vector search uses bi-encoders, which embed documents and queries separately. This is fast because you can precompute document embeddings. However, it misses deep semantic links.
Cross-encoders process the query and the document together, analyzing their relationship directly. This is much more accurate but too slow to run on millions of documents.
The solution is a two-stage pipeline. Retrieve the top 20 documents using hybrid search, then use a cross-encoder (like Cohere Rerank or BGE-Reranker) to evaluate those 20 candidates and select the top 5 to pass to the generator.
Prompt assembly and context injection
How you present context to the LLM matters. Do not just append raw text. Structure the context to make it clear which sources provide which facts.
Use XML-like tags to separate document blocks and label each block with source metadata. This helps the LLM cite its sources and reduces hallucinations:
You are a technical support assistant. Answer the user question based on the provided documents. <documents> <document id="doc_001" source="/docs/install_guide.md"> ...install steps... </document> <document id="doc_002" source="/docs/troubleshoot.md"> ...error solutions... </document> </documents> Question: How do I resolve database installation error 403? Answer:
Evaluation: RAGAS metrics and tracing
Evaluating a RAG pipeline by inspecting a few test queries is risky. You need automated metrics to measure retrieval and generation separately.
Using the RAGAS framework, we track four primary scores:
- Context Precision: Measures whether the retriever places relevant chunks at the top of the results list.
- Context Recall: Measures whether the retriever finds all the necessary information to answer the question.
- Faithfulness: Measures how well the generator sticks to the facts in the retrieved documents, highlighting hallucinations.
- Answer Relevance: Measures whether the generated output directly answers the user query.
Common failure modes and debugging
When debugging a live pipeline, look at the transition points. If the LLM gives a bad answer, determine whether the problem lies with the retriever or the generator:
- If the retriever failed to fetch the facts, you need to adjust your chunking strategy, improve your sparse search indexes, or update your embedding model.
- If the facts are in the context but the LLM missed them, you need to refine your prompts or use a stronger generation model.
Full code walkthrough: Python, LangChain, Qdrant
Below is the complete implementation of a production pipeline in Python, using Qdrant for hybrid storage and LangChain for orchestration:
import os
from qdrant_client import QdrantClient
from qdrant_client.http import models
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 1. Initialize local embeddings
embed_model = HuggingFaceEmbeddings(
model_name="nomic-ai/nomic-embed-text-v1.5",
model_kwargs={"trust_remote_code": True}
)
# 2. Connect to local Qdrant database
client = QdrantClient(url="http://localhost:6333")
collection_name = "production_docs"
# 3. Configure the collection with dense vectors
vector_size = 768 # Nomic embedding dimensions
client.recreate_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=vector_size,
distance=models.Distance.COSINE
)
)
# 4. Ingest and split documents
loader = TextLoader("documentation.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
# 5. Build dense vectors and upload payloads
points = []
for idx, chunk in enumerate(chunks):
vector = embed_model.embed_query(chunk.page_content)
points.append(
models.PointStruct(
id=idx,
vector=vector,
payload={
"page_content": chunk.page_content,
"source": chunk.metadata.get("source", "unknown")
}
)
)
client.upsert(
collection_name=collection_name,
points=points
)
# 6. Query execution with metadata filtering
def query_rag(user_query: str, client: QdrantClient, embed_model, top_k: int = 5):
query_vector = embed_model.embed_query(user_query)
# Run dense vector search
search_results = client.search(
collection_name=collection_name,
query_vector=query_vector,
limit=top_k
)
retrieved_contexts = []
for hit in search_results:
retrieved_contexts.append({
"content": hit.payload["page_content"],
"source": hit.payload["source"]
})
return retrieved_contextsSecurity and red-teaming considerations
A production RAG pipeline introduces security risks. When you fetch text from external documents and feed it to an LLM, you open your system to attacks.
The AI Red-Team Portal highlights these threat areas:
- Indirect prompt injection: An attacker embeds malicious instructions inside an uploaded document. When the retriever pulls this document into the context, the LLM reads and executes those instructions, bypassing system prompts.
- Data poisoning: Attackers upload documents containing false facts or malicious payloads. These documents are stored in the vector database and pollute the retrieval index.
- Context boundary leaks: If your vector database does not enforce tenant isolation, a user might retrieve documents containing another user's private data.
To mitigate these risks, enforce metadata-based document ownership filters at the query level. In addition, validate and sanitize retrieved context blocks before injecting them into your prompt templates.
Conclusion
Moving a RAG pipeline from a demo to production requires decoupling retrieval, sorting candidates with a reranker, and validating performance using structured metrics.
Once you set up retrieval and scoring, your next step is choosing the right inference engine for generation. If you run models locally, read our guide comparing vLLM vs Llama.cpp vs Ollama performance benchmarks to select the best engine for your setup.