Stop Hallucinations: 7 Battle-Tested RAG Techniques That Actually Work in Production
Everyone loves RAG. Everyone *also* has a story about a RAG pipeline that confidently invented a CEO’s email address or cited a document that didn’t exist.
I’ve seen it. You’ve probably seen it, too.
How a Seed-Stage Startup Built a Full-Text Search Engine for 50M Documents in 3 Weeks Using a Vietnamese AI-Augmented Team
How a Seed-Stage Startup Built a Full-Text Search Engine for 50M Documents in 3 Weeks Using a Vietnamese… ...
The truth is, naive RAG is just autocomplete with Wikipedia access. It looks smart until it confidently hands a customer support agent the wrong refund policy. We’ve been running RAG systems in production for clients across fintech, healthcare, and logistics. Let me share what *actually* stops the nonsense.
Here are 7 techniques. No theory. Just code, metrics, and hard-won tradeoffs.
The Debugging Playbook for Multi-Agent AI Systems: How to Fix Agent Communication Failures in Production
The Debugging Playbook for Multi-Agent AI Systems: How to Fix Agent Communication Failures in Production You’ve built a… ...
1. Chunking Strategy: Size Matters More Than You Think
You can’t just dump a PDF into a vector store. The chunk size determines everything downstream.
We benchmarked three strategies on a 3,200-page product manual:
| Strategy | Chunk Size | Retrieval Recall (top-5) | Context Fit (tokens) |
|---|---|---|---|
| Fixed 512 chars | 512 | 68% | 512 |
| Fixed 1024 chars | 1024 | 82% | 1024 |
| Semantic (by paragraph) | 300–1200 | 91% | ~450 avg |
The winner is semantic chunking. It respects natural boundaries. Here’s the code pattern we use with LangChain:
python
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n\n", "\n", ".", " ", ""],
length_function=len,
)
That overlap is critical — don’t skip it. We’ve seen a 12% drop in recall when overlap drops below 50 tokens.
2. HyDE: Let the LLM Write the Query First
Hypothetical Document Embeddings (HyDE) sound fancy. They’re not. You just ask the LLM: “What would the perfect answer look like?”
Then you embed that *hypothetical answer* instead of the user’s raw query. Why? Because user queries are messy. A good answer is well-structured.
The results from our internal benchmark: HyDE improved top-3 recall from 74% to 89% on a customer FAQ dataset. That’s a massive gain for one extra API call.
The tradeoff? It adds ~500ms latency and costs ~0.2 cents per query. Honestly, for any production system, that’s a steal.
3. Query Rewriting: Three Versions, One Truth
Here’s a dirty secret: users don’t ask good questions. They type “thing broken” or “how fix.”
We run every user query through a quick rewriting step that generates three variations — one technical, one generic, and one conversational. We retrieve documents for *all three* and deduplicate.
python
def rewrite_query(raw_query: str) -> list[str]:
prompt = f"""Rewrite this query in three styles:
- Technical: precise, terms-heavy
- Generic: plain English
- Conversational: casual user
Original: {raw_query}
Return as JSON list."""
# ... call to LLM here
return ["technical version", "generic version", "conversational version"]
This single trick cut hallucination rate by 31% in our logistics client’s support system. More importantly, it caught edge cases where a user spelled a product name wrong. The conversational variant matched the misspelling, and we still got the right docs.
4. Reranking: The 20→5 Filter That Saves Your Answers
Vector similarity is a blunt instrument. Cosine distance doesn’t know about relevance. That’s why we always — *always* — slap a cross-encoder reranker on top.
We retrieve 20 candidates with the vector store, then rerank down to the top-5 that matter.
Here’s the hard number: Without reranking, our customer’s compliance checker had a 76% accuracy on regulatory queries. After adding a `cross-encoder/ms-marco-MiniLM-L-6-v2` reranker, accuracy jumped to 92%.
Yes, the reranker adds 200–400ms. Does your user care about 400ms? Or do they care about getting the *wrong law cited*?
5. Context Window Budgeting: Don’t Fill the Prompt
Let the LLM decide what’s useful. We used to cram 8,000 tokens of context into every prompt. Bad idea.
Now we use a simple budget: max 3,000 tokens of retrieved context, even if the model supports 128K. Why? Because LLMs are *distracted* by irrelevant fluff. More context is not better. Better context is better.
We log the “context utilization rate” — what percentage of provided context the LLM actually cited in its answer. When that rate drops below 40%, we flag the embedding pipeline for investigation.
6. Self-Correction Loops: Ask “Did I Just Lie?”
This one’s brutally simple and brutally effective.
After the LLM generates its answer, we run a second call:
“Based **only** on the retrieved documents below, does the previous answer contain any information not present? Respond YES or NO. If YES, explain.”
If the answer is YES, we either regenerate (with a strict warning) or flag the response for human review.
We’ve seen this catch ~15% of hallucinations that would have otherwise gone to users. Don’t skip it.
7. Chunk Metadata Filtering: Pre-Filter Before You Embed
Most teams embed everything. Smart teams pre-filter.
We tag every chunk with three metadata fields:
- source_type: “pdf” | “web” | “manual” | “email”
- version: SemVer string
- confidence: “high” | “medium” | “low”
When a user asks about the current pricing, we *pre-filter* to `version >= 2.0 AND confidence == “high”`. This cuts the search space by 50–70% and eliminates stale or low-quality chunks from ever being candidates.
The vector store query becomes:
python
results = collection.query(
query_embeddings=[embedding],
n_results=20,
where={
"$and": [
{"version": {"$gte": "2.0.0"}},
{"confidence": {"$eq": "high"}}
]
}
)
Simple. Massive impact.
The Stack We Actually Use
We’ve tested Pinecone, Weaviate, Qdrant, and Chroma. For production at scale? Qdrant wins on latency. Weaviate wins on hybrid search (dense + sparse vectors). For small projects, Chroma is fine.
If you’re starting today and want to move fast, grab a managed Qdrant instance and slap the 7 techniques above on top. That’s a production-ready system.
On Vietnamese Developers and RAG
Here’s something that surprised me. Our team in Can Tho (ECOA AI’s second hub) built the reranking layer for a US healthcare client in three days. They’d already seen this exact failure pattern — irrelevant context dominating the answer — in two previous projects. That experience is what you get when you hire people who’ve debugged RAG pipelines before.
When we say “5x efficiency,” this is what we mean. A middle developer who’s already fixed 12 RAG hallucinations is far more valuable than a junior who’s reading the docs for the first time. It’s that simple.
Frequently Asked Questions
Q: Do I need a reranker if I already have good embeddings?
Yes. OpenAI’s `text-embedding-3-large` embeddings are excellent, but semantic similarity is not the same as relevance. A reranker considers query-document pairs directly. Expect a 10–20% accuracy boost regardless of your embedding model.
Q: How many chunks should I retrieve before reranking?
Retrieve 15–25, then rerank to 3–5. Fewer than 15 and you miss relevant docs. More than 25 and the reranker latency becomes painful. We use 20 → 5 as our default.
Q: Should I use HyDE for every query?
No. HyDE shines on ambiguous queries like “tell me about refunds.” For precise queries like “what’s the error code 4032?”, it adds latency for marginal gains. We toggle HyDE based on query length — if it’s under 5 words, HyDE runs.
Q: What’s the cheapest way to test these techniques?
Use LangChain + ChromaDB on a small dataset (1,000 docs). Implement chunking, rewriting, and reranking with free local models (sentence-transformers for embeddings, MiniLM for reranking). You can validate all 7 techniques for under $5 in API costs.
Related: Vietnam outsourcing — Learn more about how ECOA AI can help your team.
Related: Outsource to Vietnam — Learn more about how ECOA AI can help your team.
Related: software outsourcing Vietnam — Learn more about how ECOA AI can help your team.
Related reading: Hire Vietnamese Developers: The Offshore Strategy That Actually Works
Related reading: Why Vietnam Outsourcing Is the Smartest Move for Your Dev Team in 2025