How to Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration
Let’s be honest. Most RAG implementations you see in tutorials are toys.
They work great on a single PDF about “The Great Gatsby.” But throw 10,000 technical documents at them, and they fall apart. Bad retrievals. Hallucinations. Latency spikes.
Claude Code Guide: A Practical AI Coding Tool for Developers
Summary: Claude Code is a powerful AI coding tool that helps developers accelerate software development. This article provides… ...
I’ve been there. Recently, I helped a client in Ho Chi Minh City migrate their legacy knowledge base—over 50,000 internal support tickets—into a RAG system. The first version was a disaster. We rebuilt it from scratch.
Here’s what actually works in production.
Outsourcing Software Development? Here’s What Most CTOs Get Wrong (And How to Fix It)
TL;DR: Outsourcing software isn’t dead—but most companies kill it with poor handoffs and zero cultural onboarding. This guide… ...
Why Most RAG Pipelines Fail
The problem isn’t the LLM. It’s the retrieval layer.
Most developers spend 80% of their time tuning prompts and 20% on the data pipeline. That’s backwards. If your retrieval is garbage, no prompt engineering in the world will save you.
The three killers:
- Bad chunking — Splitting text arbitrarily, breaking semantic meaning
- Poor embedding quality — Using generic models on domain-specific data
- No reranking — Trusting cosine similarity blindly
Let’s fix all three.
Step 1: Smart Chunking Strategies
Don’t just split by character count. That’s amateur hour.
Here’s what I use in production:
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter
# For markdown documents with structure
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
# Then apply semantic chunking within each section
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=128,
separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
length_function=len,
)
Why 512 tokens with 128 overlap? It’s the sweet spot. Too small (under 256) and you lose context. Too large (over 1024) and your embeddings get diluted. The overlap ensures you don’t lose meaning at chunk boundaries.
For code-heavy documentation, I actually prefer 768 tokens. Code blocks need more room to stay intact.
Step 2: Choosing the Right Embedding Model
Don’t default to `text-embedding-ada-002` for everything. It’s good, but not always the best.
Here’s my current stack:
| Use Case | Model | Dimensions | Cost per 1K tokens |
|---|---|---|---|
| General text | `text-embedding-3-small` | 512 | $0.00002 |
| Code/documentation | `BAAI/bge-large-en-v1.5` | 1024 | Free (self-hosted) |
| Technical/domain-specific | `intfloat/e5-mistral-7b-instruct` | 4096 | Free (self-hosted) |
| Multilingual | `intfloat/multilingual-e5-large` | 1024 | Free (self-hosted) |
For our support ticket project, we used `BAAI/bge-large-en-v1.5`. It’s open-source, runs on a single GPU, and consistently outperforms OpenAI’s embeddings on technical benchmarks.
Pro tip: Always normalize your embeddings. It’s a one-liner but most people forget:
python
import numpy as np
def normalize_embedding(embedding):
norm = np.linalg.norm(embedding)
return embedding / norm if norm > 0 else embedding
Step 3: Vector Search That Doesn’t Suck
Pinecone is great. But for most teams, Qdrant or Weaviate self-hosted is cheaper and gives you more control.
Here’s a production config for Qdrant:
python
from qdrant_client import QdrantClient
from qdrant_client.http import models
client = QdrantClient(
url="http://localhost:6333",
prefer_grpc=True, # 3x faster for batch operations
)
# Create collection with tuned parameters
client.recreate_collection(
collection_name="knowledge_base",
vectors_config=models.VectorParams(
size=1024, # Match your embedding dimension
distance=models.Distance.COSINE,
hnsw_config=models.HNSWConfig(
m=16, # Higher = more accurate, more memory
ef_construct=200, # Higher = better index quality
),
),
optimizers_config=models.OptimizersConfigDiff(
indexing_threshold=10000, # Don't index until we have enough vectors
),
)
The HNSW parameters matter. `m=16` and `ef_construct=200` give you 95% recall with reasonable memory usage. Going to `m=32` gets you 98% but doubles memory. For most use cases, it’s not worth it.
Step 4: Reranking—The Secret Sauce
This is where most tutorials stop. But you’re not most developers.
After your initial vector search (top 20 results), run a cross-encoder reranker. It’s slower but dramatically more accurate:
python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, documents, top_k=5):
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)
# Sort by score descending
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_docs[:top_k]]
Real numbers from our project:
- Vector search alone: 72% retrieval accuracy
- With reranking: 94% retrieval accuracy
That’s a 22% improvement. Worth the extra 50ms latency? Absolutely.
Step 5: LLM Integration with Context Management
Now for the fun part. Feeding the retrieved context to your LLM.
Here’s the prompt template I use:
python
RAG_PROMPT = """You are a technical support engineer. Answer the question based ONLY on the provided context.
Context:
{context}
Question: {question}
Instructions:
- If the context doesn't contain the answer, say "I cannot find this information in the available documentation."
- Cite specific sections from the context when possible.
- Keep answers concise and technical.
Answer:"""
Critical detail: Always truncate context to fit within the model’s context window. For GPT-4, I limit to 8,000 tokens of context. For Claude, 16,000. Leave room for the response.
python
def truncate_context(context, max_tokens=8000):
tokens = tokenizer.encode(context)
if len(tokens) > max_tokens:
# Keep the most relevant chunks (they're already sorted by reranker)
context = tokenizer.decode(tokens[:max_tokens])
return context
Step 6: Monitoring and Evaluation
You can’t improve what you don’t measure.
Set up these metrics from day one:
python
# Simple evaluation pipeline
def evaluate_rag(query, expected_answer, pipeline):
result = pipeline(query)
metrics = {
"retrieval_precision": calculate_precision(result.retrieved_docs, expected_answer),
"answer_relevance": cosine_similarity(
embed(result.answer),
embed(expected_answer)
),
"latency_ms": result.latency_ms,
"hallucination_score": check_hallucinations(
result.answer,
result.retrieved_docs
),
}
return metrics
Target thresholds for production:
- Retrieval precision: > 0.85
- Answer relevance: > 0.80
- Latency: < 2 seconds (including reranking)
- Hallucination score: < 0.05
The Real Cost Breakdown
Here’s what running this pipeline costs for 10,000 queries/month:
| Component | Cost |
|---|---|
| Embedding API (self-hosted) | $0 (GPU: ~$50/month) |
| Vector DB (Qdrant self-hosted) | $0 (server: ~$30/month) |
| Reranker (self-hosted) | $0 (same GPU) |
| LLM API (GPT-4) | ~$200/month |
| Total | ~$280/month |
Compare that to a fully managed solution at $1,000+/month. The self-hosted approach saves you 70% and gives you full control.
Why This Matters for Your Team
Look, I get it. Building this from scratch takes time. That’s why teams at ECOAAI use this exact pipeline as a template. Our developers in Can Tho and Ho Chi Minh City have deployed this for clients in fintech, healthcare, and e-commerce.
The pattern is always the same: smart chunking → good embeddings → vector search → reranking → clean LLM integration.
It’s not magic. It’s engineering.
And honestly, once you’ve built it once, you can reuse 80% of the code for any domain. The only thing that changes is the data and the evaluation criteria.
Frequently Asked Questions
Q: Should I use dense or sparse embeddings for RAG?
A: Use dense embeddings (like BGE or E5) for semantic search, but consider hybrid search with BM25 for keyword-heavy queries. In production, I’ve seen hybrid search improve recall by 10-15% on technical documentation.
Q: How do I handle real-time updates to the knowledge base?
A: Use a streaming ingestion pipeline. When a new document arrives, chunk it, embed it, and upsert into the vector DB. Qdrant supports point-level updates. For high-throughput systems, batch updates every 5 minutes rather than real-time.
Q: What’s the minimum viable setup for a small team?
A: Start with `text-embedding-3-small` (512 dimensions), Qdrant cloud free tier, and GPT-4-mini. Skip the reranker initially. You’ll get 70% accuracy. Add the reranker when you need 90%+. Total cost: under $50/month for light usage.
Q: How do I prevent the LLM from hallucinating with retrieved context?
A: Three techniques: (1) Always include a “don’t know” instruction in your prompt, (2) Set temperature to 0 for factual queries, (3) Implement a confidence threshold—if the top retrieval score is below 0.7, return a fallback response instead of the LLM answer.
Related: outsource to Vietnam — Learn more about how ECOA AI can help your team.
Related: Vietnam offshore development — Learn more about how ECOA AI can help your team.
Related: Vietnam outsourcing — Learn more about how ECOA AI can help your team.
Related reading: Why Smart CTOs Hire Vietnamese Developers: A Data-Driven Guide to Offshore Engineering