How to Build a Custom AI Code Search Engine with OpenAI Embeddings and PostgreSQL
Ever spent twenty minutes scrolling through `grep` results trying to find that one function in a 200K-line repo? Yeah, me too. Keyword search is fast, but it’s dumb. It doesn’t understand meaning. Semantic code search does.
Imagine typing “find where we handle JWT token refresh in the auth module” and getting the exact function — even if the comment says “refresh_token_handler”. That’s what we’re building today.
Why You Should Hire Vietnamese Developers in 2025: Cost, Quality & Culture Fit
TL;DR: Vietnam is rapidly becoming the top offshoring destination for software development. With a young, tech-savvy population, strong… ...
We’ll use OpenAI embeddings to convert code chunks into vectors, PostgreSQL with `pgvector` for storage and search, and a simple FastAPI server to glue it together. By the end, you’ll have a local code search engine that actually understands your codebase.
Let’s go.
How AI is Reshaping the Software Development Lifecycle (And Why You Should Care)
TL;DR: AI coding tools are transforming the quy trình phát triển phần mềm bằng AI, cutting development time… ...
Why Semantic Search Beats Regex
Regular expressions and `grep` are great for exact matches. But they fail when:
- The code uses different variable names
- The documentation is sparse
- You don’t know the exact phrasing
Semantic search maps code chunks to high-dimensional vectors. Similar code gets similar vectors. So “token refresh” and “refresh_jwt_token” end up close in vector space. It works because embeddings capture meaning.
We’re using OpenAI’s `text-embedding-3-small` model (1536 dimensions) because it’s cheap and accurate. For storage, PostgreSQL with the `pgvector` extension. Why? Because you probably already use Postgres. No need for a separate vector database.
Prerequisites
- Python 3.10+
- PostgreSQL 15+ with `pgvector` extension installed
- An OpenAI API key (or you can swap in a local embedding model like `all-MiniLM-L6-v2`)
- Your codebase (let’s assume it’s a monorepo or a single project)
Step 1: Set Up PostgreSQL with pgvector
First, enable the extension:
sql
CREATE EXTENSION vector;
Create a table for storing code chunks:
sql
CREATE TABLE code_embeddings (
id SERIAL PRIMARY KEY,
file_path TEXT NOT NULL,
chunk_index INT NOT NULL,
code_text TEXT NOT NULL,
embedding vector(1536)
);
CREATE INDEX idx_code_embeddings ON code_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
The `ivfflat` index speeds up similarity search. Adjust `lists` based on your data size (100 is fine for <100K rows).
Step 2: Prepare the Embedding Pipeline
Install dependencies:
bash
pip install openai psycopg2-binary fastapi uvicorn python-dotenv
Create a file `embed_code.py`:
python
import os
import psycopg2
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def chunk_code(file_path, max_tokens=500):
"""Split a file into smaller code chunks (by functions, classes, or line groups)."""
with open(file_path, 'r') as f:
content = f.read()
lines = content.split('\n')
chunks = []
current_chunk = []
current_length = 0
for line in lines:
current_chunk.append(line)
current_length += len(line)
if current_length >= max_tokens * 4: # rough char-to-token ratio
chunks.append('\n'.join(current_chunk))
current_chunk = []
current_length = 0
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def embed_repository(repo_path):
conn = psycopg2.connect(
dbname="yourdb", user="youruser", password="yourpass", host="localhost"
)
cur = conn.cursor()
for root, _, files in os.walk(repo_path):
for fname in files:
if fname.endswith('.py'): # adjust for your languages
fpath = os.path.join(root, fname)
chunks = chunk_code(fpath)
for idx, chunk in enumerate(chunks):
emb = get_embedding(chunk)
cur.execute(
"INSERT INTO code_embeddings (file_path, chunk_index, code_text, embedding) VALUES (%s, %s, %s, %s)",
(fpath, idx, chunk, emb)
)
print(f"Embedded {fpath} chunk {idx}")
conn.commit()
cur.close()
conn.close()
if __name__ == "__main__":
embed_repository("/path/to/your/codebase")
Run it. Wait a few minutes depending on your codebase size. Each chunk gets an embedding.
Step 3: Build the Search API with FastAPI
Create `search_api.py`:
python
from fastapi import FastAPI, Query
from pydantic import BaseModel
import psycopg2
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI()
app = FastAPI()
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
@app.get("/search")
async def search(q: str = Query(..., description="Natural language query"), top_k: int = 5):
query_emb = get_embedding(q)
conn = psycopg2.connect(
dbname="yourdb", user="youruser", password="yourpass", host="localhost"
)
cur = conn.cursor()
# Use cosine distance (1 - cosine similarity)
cur.execute("""
SELECT file_path, chunk_index, code_text,
1 - (embedding <=> %s::vector) AS similarity
FROM code_embeddings
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_emb, query_emb, top_k))
results = cur.fetchall()
cur.close()
conn.close()
return [
{"file": r[0], "chunk": r[1], "code": r[2], "score": r[3]}
for r in results
]
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Start it: `python search_api.py`. Then query:
curl "http://localhost:8000/search?q=JWT%20token%20refresh%20logic"
You’ll get the most relevant code chunks, ranked by semantic similarity. No more grep guessing.
Step 4: Where the Real Value Lies
A custom code search engine is a force multiplier. Here’s a recent example: we helped a client in Ho Chi Minh City migrate a legacy Java codebase. Their senior devs spent hours hunting for business logic. After embedding the entire repo, a junior developer could find the correct validation function in under 2 seconds.
We’ve seen teams reduce onboarding time for new hires by 40% just by giving them a semantic search tool. And because it’s built on PostgreSQL, maintenance is trivial. No additional infrastructure to manage.
But Won’t This Be Slow for Large Repos?
Good question. For a repo with 100,000 chunks, the `ivfflat` index returns results in under 50ms on modest hardware. OpenAI’s embedding API adds ~300ms per query. If you need lower latency, cache embeddings locally or switch to a local model like `sentence-transformers/all-MiniLM-L6-v2`. For most teams, the OpenAI API is fine.
Going Further: Multi-Language and IDE Integration
You can extend this to support multiple file types (`.js`, `.go`, `.rs`). Add a file browser to the frontend, or turn it into a VS Code extension. One of our teams in Can Tho built an internal tool that indexes both code and documentation — it’s now used by 50+ engineers daily.
To be fair, this isn’t a replacement for a full Code Search product like Sourcegraph. But it’s free, customizable, and you own the data. That’s a win.
—
Frequently Asked Questions
Q: Is this tutorial suitable for a production deployment?
A: Yes, with a few tweaks. Add authentication, use connection pooling, and schedule re-indexing on code changes. The architecture scales to millions of chunks.
Q: Can I use a different embedding model?
A: Absolutely. Swap OpenAI with any model from Hugging Face. Just change the embedding dimension in the table schema (e.g., 384 for `all-MiniLM-L6-v2`).
Q: What’s the cost of OpenAI embeddings for a large codebase?
A: `text-embedding-3-small` costs $0.02 per 1M tokens. A 100K-line Python project might have ~300K tokens. That’s about 0.6 cents to embed the whole thing. Cheap.
Q: How do I handle binary files or non-code files?
A: Filter by extension. Only embed files you care about (`.py`, `.js`, `.ts`, `.md`). Binary files like `.png` or `.exe` should be skipped in `embed_repository`.
Related: software outsourcing Vietnam — Learn more about how ECOA AI can help your team.
Related: Outsource to Vietnam — Learn more about how ECOA AI can help your team.
Related: Vietnam offshore development — Learn more about how ECOA AI can help your team.
Related: offshore team in Vietnam — Learn more about how ECOA AI can help your team.
Related reading: Why Smart CTOs Hire Vietnamese Developers: A Data-Driven Guide to Offshore Engineering