Stop Building Dumb Vector Search: How I Built a Semantic Agent Memory Engine in 80 Lines of Python
Vector search is eating the world. But most implementations I see on GitHub are embarrassingly naive.
You’ve probably seen it—a simple `cosine_similarity()` over a single embedding model, stuffed into a one-shot retrieval call. It works for demos. It collapses in production.
Why Vietnam Outsourcing Is the Smartest Move for Your Tech Stack in 2025
TL;DR: Vietnam outsourcing offers a rare mix of high technical talent, competitive costs, and time zone alignment with… ...
I ran a production experiment for a real estate search agent built with a team in Ho Chi Minh City. The agent kept hallucinating property details, returning condos in District 1 when the user clearly asked for “rental studios under $500 in District 7.” The root cause? The memory engine was dumb. It had no notion of hybrid retrieval.
So I rebuilt it. Here’s exactly what I built, and why it works.
Build a Custom AI-Powered Git Pre-Commit Hook with Python: Smarter Code Quality Checks
Build a Custom AI-Powered Git Pre-Commit Hook with Python: Smarter Code Quality Checks Let’s be real for a… ...
The Architecture in One Paragraph
You don’t need LangChain. You don’t need a vector database cluster. What you need is three layers:
- A dense embedding layer for semantic meaning (sentence-transformers)
- A sparse keyword layer for exact matches and named entities (BM25)
- A result fusion layer that merges both scores with configurable weights
That’s it. 80 lines of production Python. Let me show you.
The Data Ingestion Pipeline
First, we chunk and embed documents. For the real estate bot, we ingested 14,000 property listings. Each listing gets split into 500-character chunks with 100-character overlap.
python
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from rank_bm25 import BM25Okapi
import re
class SemanticMemoryEngine:
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.encoder = SentenceTransformer(model_name)
self.documents = []
self.embeddings = []
self.bm25_corpus = []
self.bm25 = None
def ingest(self, chunks: list[str], metadata: list[dict]):
"""Ingest document chunks with their metadata."""
self.documents = chunks
self.metadata = metadata
# Dense embeddings
self.embeddings = self.encoder.encode(chunks, show_progress_bar=True,
normalize_embeddings=True)
# Sparse corpus for BM25
self.bm25_corpus = [self._tokenize(doc) for doc in chunks]
self.bm25 = BM25Okapi(self.bm25_corpus)
Notice `normalize_embeddings=True`. That’s not optional. If you skip it, your cosine similarity is meaningless. Took me a whole afternoon to debug that once.
The Hybrid Search That Actually Works
Here’s the money shot. The retrieval function that cut our hallucination rate by 64% in 2 weeks of A/B testing.
python
def search(self, query: str, k: int = 5, alpha: float = 0.4):
"""
Hybrid search with configurable dense/sparse weight.
alpha=0.4 means 40% weight on dense, 60% on sparse.
"""
# Dense retrieval
query_emb = self.encoder.encode([query], normalize_embeddings=True)
dense_scores = np.dot(self.embeddings, query_emb.T).flatten()
# Sparse retrieval (BM25)
tokenized_query = self._tokenize(query)
sparse_scores = np.array(self.bm25.get_scores(tokenized_query))
# Normalize both to [0, 1] range
dense_scores = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min() + 1e-8)
sparse_scores = (sparse_scores - sparse_scores.min()) / (sparse_scores.max() - sparse_scores.min() + 1e-8)
# Fuse
combined = (alpha * dense_scores) + ((1 - alpha) * sparse_scores)
top_k_indices = np.argsort(combined)[::-1][:k]
return [(self.documents[i], self.metadata[i], dense_scores[i], sparse_scores[i])
for i in top_k_indices]
Why `alpha=0.4`? Because we found that for real estate queries, exact matches on district names and budget ranges matter more than vague semantic similarity. Your mileage will vary. Tune it per domain.
The 1e-8? Division-by-zero insurance. Always add it.
The Problem with Pure Dense Retrieval
I want to be blunt: pure dense retrieval for production agents is risky.
Here’s a real query our old system failed on: *”apartment for rent in Thu Duc City with 2 bedrooms under 10 million VND”*
The dense-only model returned:
- A villa in District 2 (semantically “close” to “apartment”)
- A 3-bedroom condo in Binh Thanh (vector similarity liked “bedrooms” + “under”)
- A co-living space in Thu Duc (zero bedrooms, not an apartment)
None of them matched all three constraints. BM25 alone would have caught “Thu Duc” + “10 million” as exact tokens. But BM25 alone would miss the semantic intent of “apartment.”
Hybrid fusion caught all three constraints in the top-3 results.
Performance Benchmarks (Local vs. Distributed)
We tested this on a modest AWS t3.medium instance (2 vCPU, 4GB RAM) with 14,000 documents:
| Retrieval Strategy | Recall@5 | Latency (avg) | Hallucination Rate |
|---|---|---|---|
| Dense only (cosine) | 71.2% | 23ms | 18.4% |
| BM25 only | 68.9% | 12ms | 21.7% |
| Hybrid (alpha=0.4) | 89.3% | 34ms | 6.5% |
That 34ms latency? Fast enough to power a real-time chat agent. The senior developer I worked with at our Can Tho hub optimized the tokenizer in the BM25 pipeline—shaved 8ms off just by using `re.findall(r’\w+’, text)` instead of the default tokenizer. Small wins compound.
Where to Index: A Surprising Lesson
You don’t always need a vector database for this pattern.
For under 50,000 documents, keep everything in memory with NumPy and pickle. We serialize the engine once a day during the team’s Vietnam business hours, then load it into the API container. For larger datasets, we plug the same hybrid logic into Qdrant (which supports hybrid search natively). But for most agent applications, in-memory is cheaper and faster.
Here’s the serialization:
python
def save(self, path: str):
data = {
"documents": self.documents,
"metadata": self.metadata,
"embeddings": self.embeddings.tolist(),
"model_name": self.encoder.model_name
}
with open(path, "w") as f:
json.dump(data, f)
@classmethod
def load(cls, path: str):
with open(path) as f:
data = json.load(f)
engine = cls(model_name=data["model_name"])
engine.documents = data["documents"]
engine.metadata = data["metadata"]
engine.embeddings = np.array(data["embeddings"])
engine.bm25_corpus = [engine._tokenize(doc) for doc in engine.documents]
engine.bm25 = BM25Okapi(engine.bm25_corpus)
return engine
One gotcha: never pickle BM25 objects across Python versions. We lost a production deployment once because the dev environment ran Python 3.11 and the server ran 3.10. JSON serialization of documents only, rebuild BM25 on load. Trust me on this.
Why This Pattern Matters for AI Agents
Here’s the thing: AI agents are only as good as their memory. If your retrieval layer returns garbage, the LLM generates garbage. This isn’t a prompting problem—it’s an information architecture problem.
The hybrid approach gives you:
- Precision from keyword matching (for entities, dates, prices)
- Recall from semantic search (for intents, paraphrases, synonyms)
- Controllability via the alpha parameter
We’ve since reused this exact engine for three other client projects at ECOAAI: a legal document search bot, a technical support agent for a SaaS platform, and a product catalog assistant for an e-commerce client. Every time, hybrid search outperformed pure dense by at least 15%.
What You Should Tune First
Don’t tweak everything at once. Here’s the order I recommend:
- Chunk size: Start at 500 chars. Adjust based on your documents’ average paragraph length.
- alpha parameter: Run offline evaluations with 20 seed queries. Find the sweet spot.
- Embedding model: `all-MiniLM-L6-v2` is great for speed. If you need higher accuracy, try `intfloat/e5-small-v2`. But expect 3x latency.
- Re-ranking: For critical queries, add a cross-encoder re-ranker (like `cross-encoder/ms-marco-MiniLM-L-6-v2`) on the top-10 results. That’s another 50ms but boosts recall by 6-8%.
—
Frequently Asked Questions
Q: Why not just use a vector database like Pinecone?
Pinecone and Qdrant are great at scale—above 100K documents. For small-to-medium datasets, this in-memory hybrid approach is simpler, cheaper, and easier to debug. Plus, you own your data entirely. No API calls, no vendor lock-in.
Q: How do I handle real-time document updates?
Don’t rebuild the whole index. Maintain a “staging” list of new documents, run incremental embeddings, and merge. Set a nightly cron job to persist and reload the full engine. The 34ms query still holds while writes queue up.
Q: Can I use this with an agent orchestration platform like ECOA AI Platform ACP?
Yes. We built a custom tool wrapper around this engine. The agent calls a single `memory_search(query)` tool. The ECOA orchestrator handles the routing, context injection, and error recovery. The engine itself stays stateless and fast.
Q: What if my queries are mostly exact matches (like product SKUs)?
Drop the alpha to 0.1 or even 0.0. Pure BM25 will dominate. Keep the dense layer as a fallback for fuzzy or conversational queries. We’ve seen this pattern work well for inventory search bots where users type “the blue one from last week” instead of a SKU number.
Related reading: Why Vietnam Outsourcing Is the Smartest Play in 2025: A CTO’s Perspective