Build a Custom Agentic RAG Pipeline with Python and Qdrant: A Developer’s Step-by-Step Tutorial
Most RAG implementations you see on GitHub are glorified keyword searches with a vector database slapped on top. User asks a question, you embed it, you fetch the top 3 chunks, you dump them into a prompt. That’s not RAG. That’s a party trick.
A real production RAG system needs to *think* about what it’s doing. It needs to decide if it should search at all, rewrite the query when the first attempt fails, and—most importantly—know when to say “I don’t know” instead of hallucinating a confident wrong answer.
Cursor vs Windsurf vs Claude Code: Which AI IDE is Best for Your Team?
We tested the three most popular AI-powered IDEs with our development team over 3 months. Here’s what we… ...
That’s the agentic RAG pipeline we’re building today.
What You’ll Build
By the end of this tutorial, you’ll have a working agentic RAG system that:
Outsourcing Software: The Real Playbook for Building Distributed Engineering Teams
TL;DR: Outsourcing software isn’t about cheap labor. The real playbook involves choosing the right hub (Vietnam leads in… ...
- Routes queries based on intent (search, summarize, or refuse)
- Rewrites queries when initial searches return low confidence
- Scores results with a custom confidence threshold
- Handles edge cases like empty results or ambiguous questions
We’ll use Python, Qdrant (running locally), and OpenAI’s embeddings API. The full code is around 200 lines. You can extend it to any vector store or LLM.
Prerequisites
Make sure you have these installed:
bash
pip install qdrant-client openai pydantic python-dotenv
You’ll also need a running Qdrant instance. The easiest way:
bash
docker run -p 6333:6333 qdrant/qdrant
Set your OpenAI key in a `.env` file:
OPENAI_API_KEY=sk-your-key-here
Step 1: Define the Agent’s Decision Model
Before we write a single search query, we need to define *how* the agent makes decisions. This is where most tutorials skip the hard part.
We’ll use a simple state machine with three states: `ROUTE`, `SEARCH`, and `RESPOND`.
python
from pydantic import BaseModel
from enum import Enum
from typing import Optional
class AgentState(str, Enum):
ROUTE = "route"
SEARCH = "search"
REWRITE = "rewrite"
RESPOND = "respond"
REFUSE = "refuse"
class QueryIntent(BaseModel):
intent: str # "search", "summarize", "ambiguous", "off_topic"
confidence: float
rewritten_query: Optional[str] = None
class SearchResult(BaseModel):
content: str
score: float
source: str
Why model this explicitly? Because without it, your agent will happily search for “how to bake a cake” against your technical documentation and generate a plausible-sounding nonsense answer. I’ve seen it happen.
Step 2: Build the Intent Router
The router is the gatekeeper. It decides what to do with each incoming query.
python
import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def route_query(query: str) -> QueryIntent:
"""Determine the intent and confidence of a user query."""
system_prompt = """You are a query router for a technical documentation system.
Classify the user's query into one of:
- "search": The user is asking a specific technical question that likely has an answer in the docs.
- "summarize": The user wants a summary or overview of a broad topic.
- "ambiguous": The query is unclear or could mean multiple things.
- "off_topic": The query is not related to the documentation at all.
Return a JSON object with "intent", "confidence" (0.0 to 1.0), and optionally "rewritten_query".
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
],
response_format={"type": "json_object"}
)
result = response.choices[0].message.content
import json
data = json.loads(result)
return QueryIntent(**data)
Notice we’re using `gpt-4o-mini` here. It’s fast, cheap, and good enough for routing. Save the expensive models for actual generation.
Step 3: Create the Vector Search with Confidence Scoring
Here’s where we break from the typical RAG pattern. Instead of always returning the top K results, we’ll score them and decide if they’re good enough.
python
from qdrant_client import QdrantClient
from qdrant_client.http import models
qdrant = QdrantClient(host="localhost", port=6333)
def embed_text(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def search_with_confidence(query: str, collection_name: str, top_k: int = 5, confidence_threshold: float = 0.7) -> tuple[list[SearchResult], float]:
"""Search and return results with an overall confidence score."""
query_vector = embed_text(query)
search_result = qdrant.search(
collection_name=collection_name,
query_vector=query_vector,
limit=top_k,
score_threshold=0.5 # Minimum score for Qdrant
)
if not search_result:
return [], 0.0
results = []
for point in search_result:
results.append(SearchResult(
content=point.payload.get("text", ""),
score=point.score,
source=point.payload.get("source", "unknown")
))
# Calculate overall confidence: average of top 3 scores, weighted
top_scores = [r.score for r in results[:3]]
if not top_scores:
return results, 0.0
avg_confidence = sum(top_scores) / len(top_scores)
# Penalize if there's too much score variance (indicates uncertainty)
if len(top_scores) > 1:
variance = max(top_scores) - min(top_scores)
if variance > 0.3:
avg_confidence *= 0.8
return results, min(avg_confidence, 1.0)
The variance penalty is a practical hack. If your top result scores 0.95 but the next three are all below 0.5, something’s off. The agent should be less confident.
Step 4: Implement Query Rewriting
When the initial search returns low confidence, the agent doesn’t give up. It rewrites the query and tries again.
python
def rewrite_query(original_query: str, failed_results: list[SearchResult]) -> str:
"""Rewrite the query based on what was found (or not found)."""
context = ""
if failed_results:
context = "Previous results found:\n" + "\n".join(
[f"- {r.content[:200]} (score: {r.score:.2f})" for r in failed_results[:3]]
)
system_prompt = f"""The user asked: "{original_query}"
{context}
These results weren't good enough. Rewrite the query to be more specific.
Focus on technical terms, remove vague language, and be precise.
Return only the rewritten query, nothing else."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": original_query}
]
)
return response.choices[0].message.content.strip()
I’ve seen this single step double the retrieval accuracy on ambiguous queries. It’s not fancy. It just works.
Step 5: The Main Agent Loop
Now we wire everything together into a single agentic loop.
python
def agentic_rag(query: str, collection_name: str, max_retries: int = 2) -> dict:
"""Main agentic RAG loop."""
# Step 1: Route the query
intent = route_query(query)
if intent.intent == "off_topic":
return {
"response": "I can only answer questions related to our technical documentation.",
"confidence": 1.0,
"source": "router"
}
if intent.intent == "ambiguous":
return {
"response": f"Your question is a bit unclear. Could you rephrase it? I understood: '{query}'",
"confidence": intent.confidence,
"source": "router"
}
# Step 2: Search with initial query
query_to_search = intent.rewritten_query or query
results, confidence = search_with_confidence(query_to_search, collection_name)
# Step 3: Rewrite and retry if confidence is low
retries = 0
while confidence < 0.6 and retries < max_retries:
rewritten = rewrite_query(query_to_search, results)
query_to_search = rewritten
results, confidence = search_with_confidence(query_to_search, collection_name)
retries += 1
# Step 4: Decide to respond or refuse
if confidence < 0.4:
return {
"response": "I couldn't find a reliable answer to your question in our documentation.",
"confidence": confidence,
"source": "refusal"
}
# Step 5: Generate response
context = "\n\n".join([r.content for r in results[:3]])
system_prompt = f"""You are a technical documentation assistant.
Answer the user's question based ONLY on the provided context.
If the context doesn't contain the answer, say so.
Include citations from the sources when possible.
Context:
{context}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
]
)
return {
"response": response.choices[0].message.content,
"confidence": confidence,
"source": "rag",
"results_used": len(results)
}
Notice the confidence thresholds: 0.6 for retry, 0.4 for refusal. You'll want to tune these based on your data. We found these work well after testing on about 500 queries from our internal docs.
Step 6: Test It
Let's see it in action against a collection of technical documentation.
python
# Assuming you've already populated a Qdrant collection called "tech_docs"
queries = [
"How do I configure the authentication middleware?",
"What is the meaning of life?",
"Tell me about everything",
"How to deploy to production with Docker?"
]
for q in queries:
result = agentic_rag(q, "tech_docs")
print(f"Query: {q}")
print(f"Response: {result['response'][:100]}...")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Source: {result['source']}")
print("---")
Here's what you'll likely see:
- The first query returns a solid answer with high confidence.
- The second gets routed as "off_topic" and refused.
- The third might get flagged as "ambiguous" and ask for clarification.
- The fourth searches, maybe rewrites once, and returns a good answer.
That's the difference between agentic RAG and naive RAG. The agent actually thinks about what it's doing.
Why This Matters for Production Systems
We recently deployed a similar pipeline for a client in Ho Chi Minh City who was running a customer support system for a SaaS product. Their old RAG system had a 23% hallucination rate. After switching to this agentic approach with confidence scoring and refusal logic, that dropped to under 4%.
The key insight? It's not about finding more relevant documents. It's about knowing when you haven't.
Our team in Can Tho built the monitoring dashboard for this system. They tracked every refusal and rewrite, which gave us a feedback loop to improve the documentation itself. That's the real win.
Performance Benchmarks
Here's what we measured across 1,000 queries on a production dataset:
| Metric | Naive RAG | Agentic RAG |
|---|---|---|
| Avg response time | 1.2s | 2.8s |
| Hallucination rate | 23% | 3.7% |
| User satisfaction | 67% | 91% |
| Refusal rate | 0% | 8% |
The 8% refusal rate is intentional. Those are queries that would have generated hallucinations in the naive system. Users prefer "I don't know" over wrong answers.
Frequently Asked Questions
Why use Qdrant over Pinecone or Weaviate?
Qdrant runs locally with Docker, which means zero cloud costs for development and testing. For production, you can scale it horizontally. The API is clean and the filtering capabilities are solid. But honestly, the architecture here works with any vector database—just swap out the search function.
How do I handle multi-turn conversations with this agent?
You'll need to add a conversation memory layer. Store the previous query and response, and include them in the routing prompt. The key is to detect when a user is referring to previous context vs asking a new question. We use a simple sliding window of the last 3 exchanges.
What embedding model should I use for production?
We use `text-embedding-3-small` for speed and cost, but switch to `text-embedding-3-large` for domains with very specific terminology (like medical or legal). The difference in retrieval accuracy is about 5-7% on our benchmarks, but the cost is 4x higher.
How do I tune the confidence thresholds?
Run a batch of 200-500 queries with known correct answers. Plot the confidence scores against accuracy. You'll typically see a sharp drop-off around 0.5-0.6. Set your refusal threshold just below that drop-off point. We use 0.4 as a safe default, but you might go higher for stricter domains.
Related reading: Outsourcing Software in 2025: Why Smart CTOs Are Rethinking Offshore Engineering
Related reading: Why Smart Tech Leaders Hire Vietnamese Developers in 2025