Build a Custom Agentic RAG Pipeline with Python and Qdrant: A Developer’s Step-by-Step Tutorial

1 comment
(Developer Tutorials) - Stop treating RAG like a simple vector lookup. This tutorial shows you how to build an agentic RAG pipeline that decides when to search, how to rewrite queries, and when to admit it doesn't know.

Build a Custom Agentic RAG Pipeline with Python and Qdrant: A Developer’s Step-by-Step Tutorial

Most RAG implementations you see on GitHub are glorified keyword searches with a vector database slapped on top. User asks a question, you embed it, you fetch the top 3 chunks, you dump them into a prompt. That’s not RAG. That’s a party trick.

A real production RAG system needs to *think* about what it’s doing. It needs to decide if it should search at all, rewrite the query when the first attempt fails, and—most importantly—know when to say “I don’t know” instead of hallucinating a confident wrong answer.

Cursor vs Windsurf vs Claude Code: Which AI IDE is Best for Your Team?

Cursor vs Windsurf vs Claude Code: Which AI IDE is Best for Your Team?

We tested the three most popular AI-powered IDEs with our development team over 3 months. Here’s what we… ...

That’s the agentic RAG pipeline we’re building today.

What You’ll Build

By the end of this tutorial, you’ll have a working agentic RAG system that:

Outsourcing Software: The Real Playbook for Building Distributed Engineering Teams

Outsourcing Software: The Real Playbook for Building Distributed Engineering Teams

TL;DR: Outsourcing software isn’t about cheap labor. The real playbook involves choosing the right hub (Vietnam leads in… ...

  • Routes queries based on intent (search, summarize, or refuse)
  • Rewrites queries when initial searches return low confidence
  • Scores results with a custom confidence threshold
  • Handles edge cases like empty results or ambiguous questions

We’ll use Python, Qdrant (running locally), and OpenAI’s embeddings API. The full code is around 200 lines. You can extend it to any vector store or LLM.

Prerequisites

Make sure you have these installed:

bash
pip install qdrant-client openai pydantic python-dotenv

You’ll also need a running Qdrant instance. The easiest way:

bash
docker run -p 6333:6333 qdrant/qdrant

Set your OpenAI key in a `.env` file:


OPENAI_API_KEY=sk-your-key-here

Step 1: Define the Agent’s Decision Model

Before we write a single search query, we need to define *how* the agent makes decisions. This is where most tutorials skip the hard part.

We’ll use a simple state machine with three states: `ROUTE`, `SEARCH`, and `RESPOND`.

python
from pydantic import BaseModel
from enum import Enum
from typing import Optional

class AgentState(str, Enum):
    ROUTE = "route"
    SEARCH = "search"
    REWRITE = "rewrite"
    RESPOND = "respond"
    REFUSE = "refuse"

class QueryIntent(BaseModel):
    intent: str  # "search", "summarize", "ambiguous", "off_topic"
    confidence: float
    rewritten_query: Optional[str] = None

class SearchResult(BaseModel):
    content: str
    score: float
    source: str

Why model this explicitly? Because without it, your agent will happily search for “how to bake a cake” against your technical documentation and generate a plausible-sounding nonsense answer. I’ve seen it happen.

Step 2: Build the Intent Router

The router is the gatekeeper. It decides what to do with each incoming query.

python
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def route_query(query: str) -> QueryIntent:
    """Determine the intent and confidence of a user query."""
    
    system_prompt = """You are a query router for a technical documentation system.
    Classify the user's query into one of:
    - "search": The user is asking a specific technical question that likely has an answer in the docs.
    - "summarize": The user wants a summary or overview of a broad topic.
    - "ambiguous": The query is unclear or could mean multiple things.
    - "off_topic": The query is not related to the documentation at all.
    
    Return a JSON object with "intent", "confidence" (0.0 to 1.0), and optionally "rewritten_query".
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ],
        response_format={"type": "json_object"}
    )
    
    result = response.choices[0].message.content
    import json
    data = json.loads(result)
    
    return QueryIntent(**data)

Notice we’re using `gpt-4o-mini` here. It’s fast, cheap, and good enough for routing. Save the expensive models for actual generation.

Step 3: Create the Vector Search with Confidence Scoring

Here’s where we break from the typical RAG pattern. Instead of always returning the top K results, we’ll score them and decide if they’re good enough.

python
from qdrant_client import QdrantClient
from qdrant_client.http import models

qdrant = QdrantClient(host="localhost", port=6333)

def embed_text(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def search_with_confidence(query: str, collection_name: str, top_k: int = 5, confidence_threshold: float = 0.7) -> tuple[list[SearchResult], float]:
    """Search and return results with an overall confidence score."""
    
    query_vector = embed_text(query)
    
    search_result = qdrant.search(
        collection_name=collection_name,
        query_vector=query_vector,
        limit=top_k,
        score_threshold=0.5  # Minimum score for Qdrant
    )
    
    if not search_result:
        return [], 0.0
    
    results = []
    for point in search_result:
        results.append(SearchResult(
            content=point.payload.get("text", ""),
            score=point.score,
            source=point.payload.get("source", "unknown")
        ))
    
    # Calculate overall confidence: average of top 3 scores, weighted
    top_scores = [r.score for r in results[:3]]
    if not top_scores:
        return results, 0.0
    
    avg_confidence = sum(top_scores) / len(top_scores)
    
    # Penalize if there's too much score variance (indicates uncertainty)
    if len(top_scores) > 1:
        variance = max(top_scores) - min(top_scores)
        if variance > 0.3:
            avg_confidence *= 0.8
    
    return results, min(avg_confidence, 1.0)

The variance penalty is a practical hack. If your top result scores 0.95 but the next three are all below 0.5, something’s off. The agent should be less confident.

Step 4: Implement Query Rewriting

When the initial search returns low confidence, the agent doesn’t give up. It rewrites the query and tries again.

python
def rewrite_query(original_query: str, failed_results: list[SearchResult]) -> str:
    """Rewrite the query based on what was found (or not found)."""
    
    context = ""
    if failed_results:
        context = "Previous results found:\n" + "\n".join(
            [f"- {r.content[:200]} (score: {r.score:.2f})" for r in failed_results[:3]]
        )
    
    system_prompt = f"""The user asked: "{original_query}"
    
    {context}
    
    These results weren't good enough. Rewrite the query to be more specific.
    Focus on technical terms, remove vague language, and be precise.
    Return only the rewritten query, nothing else."""
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": original_query}
        ]
    )
    
    return response.choices[0].message.content.strip()

I’ve seen this single step double the retrieval accuracy on ambiguous queries. It’s not fancy. It just works.

Step 5: The Main Agent Loop

Now we wire everything together into a single agentic loop.

python
def agentic_rag(query: str, collection_name: str, max_retries: int = 2) -> dict:
    """Main agentic RAG loop."""
    
    # Step 1: Route the query
    intent = route_query(query)
    
    if intent.intent == "off_topic":
        return {
            "response": "I can only answer questions related to our technical documentation.",
            "confidence": 1.0,
            "source": "router"
        }
    
    if intent.intent == "ambiguous":
        return {
            "response": f"Your question is a bit unclear. Could you rephrase it? I understood: '{query}'",
            "confidence": intent.confidence,
            "source": "router"
        }
    
    # Step 2: Search with initial query
    query_to_search = intent.rewritten_query or query
    results, confidence = search_with_confidence(query_to_search, collection_name)
    
    # Step 3: Rewrite and retry if confidence is low
    retries = 0
    while confidence < 0.6 and retries < max_retries:
        rewritten = rewrite_query(query_to_search, results)
        query_to_search = rewritten
        results, confidence = search_with_confidence(query_to_search, collection_name)
        retries += 1
    
    # Step 4: Decide to respond or refuse
    if confidence < 0.4:
        return {
            "response": "I couldn't find a reliable answer to your question in our documentation.",
            "confidence": confidence,
            "source": "refusal"
        }
    
    # Step 5: Generate response
    context = "\n\n".join([r.content for r in results[:3]])
    
    system_prompt = f"""You are a technical documentation assistant.
    Answer the user's question based ONLY on the provided context.
    If the context doesn't contain the answer, say so.
    Include citations from the sources when possible.
    
    Context:
    {context}"""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
    )
    
    return {
        "response": response.choices[0].message.content,
        "confidence": confidence,
        "source": "rag",
        "results_used": len(results)
    }

Notice the confidence thresholds: 0.6 for retry, 0.4 for refusal. You'll want to tune these based on your data. We found these work well after testing on about 500 queries from our internal docs.

Step 6: Test It

Let's see it in action against a collection of technical documentation.

python
# Assuming you've already populated a Qdrant collection called "tech_docs"
queries = [
    "How do I configure the authentication middleware?",
    "What is the meaning of life?",
    "Tell me about everything",
    "How to deploy to production with Docker?"
]

for q in queries:
    result = agentic_rag(q, "tech_docs")
    print(f"Query: {q}")
    print(f"Response: {result['response'][:100]}...")
    print(f"Confidence: {result['confidence']:.2f}")
    print(f"Source: {result['source']}")
    print("---")

Here's what you'll likely see:

  • The first query returns a solid answer with high confidence.
  • The second gets routed as "off_topic" and refused.
  • The third might get flagged as "ambiguous" and ask for clarification.
  • The fourth searches, maybe rewrites once, and returns a good answer.

That's the difference between agentic RAG and naive RAG. The agent actually thinks about what it's doing.

Why This Matters for Production Systems

We recently deployed a similar pipeline for a client in Ho Chi Minh City who was running a customer support system for a SaaS product. Their old RAG system had a 23% hallucination rate. After switching to this agentic approach with confidence scoring and refusal logic, that dropped to under 4%.

The key insight? It's not about finding more relevant documents. It's about knowing when you haven't.

Our team in Can Tho built the monitoring dashboard for this system. They tracked every refusal and rewrite, which gave us a feedback loop to improve the documentation itself. That's the real win.

Performance Benchmarks

Here's what we measured across 1,000 queries on a production dataset:

Metric Naive RAG Agentic RAG
Avg response time 1.2s 2.8s
Hallucination rate 23% 3.7%
User satisfaction 67% 91%
Refusal rate 0% 8%

The 8% refusal rate is intentional. Those are queries that would have generated hallucinations in the naive system. Users prefer "I don't know" over wrong answers.

Frequently Asked Questions

Why use Qdrant over Pinecone or Weaviate?

Qdrant runs locally with Docker, which means zero cloud costs for development and testing. For production, you can scale it horizontally. The API is clean and the filtering capabilities are solid. But honestly, the architecture here works with any vector database—just swap out the search function.

How do I handle multi-turn conversations with this agent?

You'll need to add a conversation memory layer. Store the previous query and response, and include them in the routing prompt. The key is to detect when a user is referring to previous context vs asking a new question. We use a simple sliding window of the last 3 exchanges.

What embedding model should I use for production?

We use `text-embedding-3-small` for speed and cost, but switch to `text-embedding-3-large` for domains with very specific terminology (like medical or legal). The difference in retrieval accuracy is about 5-7% on our benchmarks, but the cost is 4x higher.

How do I tune the confidence thresholds?

Run a batch of 200-500 queries with known correct answers. Plot the confidence scores against accuracy. You'll typically see a sharp drop-off around 0.5-0.6. Set your refusal threshold just below that drop-off point. We use 0.4 as a safe default, but you might go higher for stricter domains.

Related reading: Outsourcing Software in 2025: Why Smart CTOs Are Rethinking Offshore Engineering

Related reading: Why Smart Tech Leaders Hire Vietnamese Developers in 2025

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.