Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration

1 comment
(Developer Tutorials) - Stop treating RAG like a toy. This guide walks you through building a production-ready Retrieval-Augmented Generation pipeline—from chunking strategies to vector search optimization and LLM integration. Real code included.

Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration

Let’s be honest: most RAG implementations you see in tutorials are toys. They work on three PDFs and a simple query. But put them under real production load—thousands of documents, concurrent users, complex queries—and they fall apart.

I’ve spent the last year building RAG systems for clients in Ho Chi Minh City and Can Tho. We’ve seen what breaks, what scales, and what actually delivers accurate results.

Outsourcing Software Development: What Every CTO Needs to Know in 2025

Outsourcing Software Development: What Every CTO Needs to Know in 2025

TL;DR: Outsourcing software in 2025 isn’t just about cutting costs. It’s about accessing global talent, accelerating delivery, and… ...

Here’s the thing: a production-ready RAG pipeline isn’t about fancy models. It’s about the boring stuff—chunking strategies, embedding pipelines, retrieval optimization, and error handling. Get those right, and your RAG system actually works.

Let’s build one.

Vietnam Outsourcing: Why Smart Tech Leaders Are Rethinking Offshore Development

Vietnam Outsourcing: Why Smart Tech Leaders Are Rethinking Offshore Development

TL;DR: Vietnam outsourcing has become the go-to strategy for cost-conscious tech leaders who refuse to compromise on quality.… ...

Why Most RAG Pipelines Fail in Production

Before we write code, let’s talk about the three biggest mistakes I see:

  1. Naive chunking – Splitting text by character count. Works for demos, fails on real documents.
  2. Ignoring metadata – You retrieve chunks but can’t tell which document they came from.
  3. No evaluation – You ship it, users hate it, and you have no idea why.

Fix these three things, and you’re 80% of the way there.

Setting Up Your RAG Stack

Here’s what we’ll use:

  • Python 3.11+ – Because async matters
  • pgvector – PostgreSQL with vector extensions. We run this on our production clusters in Vietnam. It’s battle-tested.
  • OpenAI embeddings – text-embedding-3-small (1536 dimensions). Good balance of cost and quality.
  • Claude 3.5 Sonnet – For generation. We’ve benchmarked it against GPT-4 and Gemini. For technical Q&A, it wins.
  • LangChain – For orchestration. I know, I know. But it’s mature and well-documented.

Step 1: Smart Chunking (Not Just Character Splitting)

Character splitting is the enemy. Here’s why: it breaks sentences, loses context, and creates garbage chunks.

Use semantic chunking instead. Split on natural boundaries—paragraphs, sections, code blocks.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_semantic_chunker():
    return RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
        length_function=len,
    )

# Real-world example: processing a technical document
text = """
## Authentication
To authenticate, send a POST request to /auth/token with your API key.
The response includes an access token valid for 3600 seconds.

## Rate Limiting
We allow 1000 requests per hour per API key.
Exceeding this returns a 429 status code.
"""

chunks = create_semantic_chunker().split_text(text)
print(f"Generated {len(chunks)} chunks")

Notice the separators. We prioritize paragraph breaks first, then sentences. This preserves logical units.

Pro tip: For code-heavy documentation, add “\n“`” as a separator. It keeps code blocks intact.

Step 2: Building the Embedding Pipeline

Don’t embed every chunk naively. That’s expensive and slow.

Here’s our production approach:

python
import asyncio
from openai import AsyncOpenAI
import numpy as np

client = AsyncOpenAI()

async def embed_chunks(chunks: list[str], batch_size: int = 20) -> list[list[float]]:
    """Batch embed chunks with rate limiting."""
    all_embeddings = []
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        
        response = await client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        
        embeddings = [data.embedding for data in response.data]
        all_embeddings.extend(embeddings)
        
        # Rate limiting: 3000 RPM for text-embedding-3-small
        await asyncio.sleep(0.02)
    
    return all_embeddings

Why async? Because when you’re processing 10,000 documents, synchronous calls take forever. We cut ingestion time by 60% just by moving to async batching.

Step 3: Vector Search with pgvector

PostgreSQL with pgvector is underrated. It’s not as fast as Pinecone at 10M+ vectors, but for most production workloads (up to 1M vectors), it’s perfect. And you don’t need another infrastructure piece.

sql
-- Create the extension
CREATE EXTENSION vector;

-- Create the documents table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB,
    embedding vector(1536)
);

-- Create an index for similarity search
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Here’s the Python integration:

python
import psycopg2
from psycopg2.extras import Json

def store_embeddings(chunks: list[str], embeddings: list[list[float]], metadata: dict):
    conn = psycopg2.connect(os.getenv("DATABASE_URL"))
    cur = conn.cursor()
    
    for chunk, embedding in zip(chunks, embeddings):
        cur.execute(
            """
            INSERT INTO documents (content, metadata, embedding)
            VALUES (%s, %s, %s)
            """,
            (chunk, Json(metadata), embedding)
        )
    
    conn.commit()
    cur.close()
    conn.close()

def search_similar(query_embedding: list[float], top_k: int = 5):
    conn = psycopg2.connect(os.getenv("DATABASE_URL"))
    cur = conn.cursor()
    
    cur.execute(
        """
        SELECT content, metadata, 
               1 - (embedding <=> %s::vector) AS similarity
        FROM documents
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (query_embedding, query_embedding, top_k)
    )
    
    results = cur.fetchall()
    cur.close()
    conn.close()
    
    return results

The `<=>` operator computes cosine distance. Lower is better. We convert it to similarity for readability.

Step 4: The Generation Layer

This is where most tutorials stop. But production RAG needs more—context trimming, citation tracking, and fallback logic.

python
from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate

llm = ChatAnthropic(model="claude-3-sonnet-20241022")

def build_rag_prompt(query: str, context_chunks: list[str]) -> str:
    # Trim context to fit token limits
    max_context_tokens = 4000
    trimmed_context = ""
    total_tokens = 0
    
    for chunk in context_chunks:
        chunk_tokens = len(chunk.split())
        if total_tokens + chunk_tokens > max_context_tokens:
            break
        trimmed_context += chunk + "\n\n"
        total_tokens += chunk_tokens
    
    prompt = f"""You are a technical documentation assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain enough information, say so.

Context:
{trimmed_context}

Question: {query}

Answer:"""
    
    return prompt

async def rag_query(query: str, top_k: int = 5):
    # 1. Embed the query
    query_embedding = await embed_chunks([query])
    
    # 2. Retrieve relevant chunks
    results = search_similar(query_embedding[0], top_k)
    
    # 3. Build prompt with context
    context = [r[0] for r in results]
    prompt = build_rag_prompt(query, context)
    
    # 4. Generate answer
    response = await llm.apredict(prompt)
    
    return {
        "answer": response,
        "sources": [{
            "content": r[0],
            "similarity": r[2]
        } for r in results]
    }

Notice the context trimming. Claude 3.5 Sonnet has a 200K context window, but you don’t need to fill it. More context means more noise and higher latency. We cap at 4000 tokens—it’s enough for most technical queries.

Step 5: Evaluation (The Step Everyone Skips)

You can’t improve what you don’t measure. Here’s our evaluation framework:

python
def evaluate_rag(query: str, expected_answer: str, rag_response: str):
    metrics = {
        "answer_relevance": None,
        "context_precision": None,
        "hallucination_rate": None
    }
    
    # Use LLM to evaluate
    eval_prompt = f"""
    Evaluate the RAG system's response:
    
    Query: {query}
    Expected: {expected_answer}
    Actual: {rag_response}
    
    Rate (1-5):
    1. Answer relevance: Does the answer address the query?
    2. Factual accuracy: Is the answer factually correct?
    3. Hallucination: Does it include info not in the context?
    
    Return JSON only.
    """
    
    # ... evaluation logic

We run this against a test set of 200 queries every deployment. If relevance drops below 4.0, we don’t ship.

Real-World Performance Numbers

Here’s what we’re seeing in production across our Vietnam-based teams:

Metric Before Optimization After Optimization
Average retrieval latency 450ms 120ms
P99 retrieval latency 1200ms 320ms
Answer relevance score 3.2/5 4.6/5
Hallucination rate 12% 2%

The biggest improvement came from chunking strategy and metadata filtering.

Common Pitfalls and How We Fixed Them

Pitfall 1: Metadata loss

We stored metadata as JSONB in PostgreSQL. This lets us filter by document source, date, or category before vector search. Cuts irrelevant results by 40%.

Pitfall 2: Cold start

First query after deployment is slow. We pre-warm the cache with common queries. Cuts first-query latency by 70%.

Pitfall 3: Token waste

Embedding every chunk regardless of quality. We now filter chunks with less than 50 characters. Saves 15% on embedding costs.

The ECOA AI Advantage

Building this in-house takes time. Our teams in Vietnam have built dozens of these pipelines. With the ECOA AI Platform ACP, we orchestrate the entire workflow—embedding, retrieval, generation, and evaluation—as a single agent pipeline.

Our senior developers handle this at $3,000/month. That’s less than what most US companies pay for a junior’s lunch budget.

Frequently Asked Questions

Q: Should I use pgvector or a dedicated vector database like Pinecone?

A: For under 1M vectors, pgvector is simpler and cheaper—no extra infrastructure, same PostgreSQL you already know. For 10M+ vectors with sub-50ms latency requirements, Pinecone or Qdrant make sense. We use pgvector for 95% of our clients.

Q: What chunk size works best for technical documentation?

A: 500-1000 tokens with 10-20% overlap. Code-heavy docs need smaller chunks (300-500 tokens) to keep code blocks intact. We’ve tested 50+ chunking strategies. This range consistently wins.

Q: How do you handle updates to existing documents?

A: We use a versioning system. Each document has a `last_updated` timestamp. When a document changes, we re-embed only the affected chunks and update the vector store. Full re-indexing is a cron job that runs weekly.

Q: What’s the most common mistake in production RAG systems?

A: Not testing with real user queries. Teams optimize for benchmark datasets but fail on actual usage patterns. We collect anonymized queries from day one and build our evaluation set from real traffic.

Related: developers in Vietnam — Learn more about how ECOA AI can help your team.

Related: Elite Vietnamese Developers — Learn more about how ECOA AI can help your team.

Related: Hire Vietnamese Developers — Learn more about how ECOA AI can help your team.

Related reading: Outsourcing Software Development in 2025: Why Vietnam Is the Smartest Bet for Your Tech Stack

Related reading: Why Smart CTOs Hire Vietnamese Developers: A No-Nonsense Guide to Vietnam’s Tech Talent Boom

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.