Build a Custom AI-Powered Unit Test Generator with Python and GPT-4o: A Step-by-Step Developer Tutorial

Let’s be honest. Writing unit tests is the least glamorous part of shipping software. You know they’re critical. You know your CI pipeline will fail without them. But when a deadline looms, tests are the first thing to get cut.

I’ve been there. We all have.

Why Vietnam Outsourcing Is Winning the Offshore Software Development Race

TL;DR: Vietnam outsourcing offers a unique blend of skilled engineers, competitive costs, and favorable time zones. Companies reduce… ...

But what if you could generate 80% of your unit tests automatically? Not flaky, generic nonsense — but real tests that cover edge cases, mock external dependencies, and actually run in your CI pipeline.

That’s what we’re building today.

Outsourcing Software in 2025: The Playbook for CTOs Building Global Engineering Teams

TL;DR: Outsourcing software is no longer just about cutting costs—it’s about accessing elite engineering talent globally. This guide… ...

Why Most “AI Test Generators” Suck

Before we write code, let’s talk about why existing solutions fail.

The problem is context. Most AI test generators just feed a function signature to an LLM and pray. They don’t understand your project’s mocking patterns, your fixture conventions, or your database setup. The result? Tests that look right but fail immediately.

We’re going to fix that by building a generator that:

Parses your actual source code with AST (Abstract Syntax Trees)
Extracts function signatures, docstrings, and type hints
Detects external dependencies (database calls, API requests, file I/O)
Generates pytest-compatible tests with proper mocking using `unittest.mock`
Uses GPT-4o to write the actual test logic

Let’s build it.

Project Setup

Create a new directory and set up a virtual environment:

bash
mkdir ai-test-generator
cd ai-test-generator
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install the dependencies:

bash
pip install openai==1.55.0 pytest==8.3.4 python-dotenv==1.0.1

Create a `.env` file with your OpenAI API key:


OPENAI_API_KEY=sk-your-key-here

Step 1: The AST Analyzer

We need to understand what we’re testing. Python’s `ast` module lets us parse source code and extract meaningful structures.

Create `analyzer.py`:

python
import ast
from typing import List, Dict, Optional

class FunctionInfo:
    def __init__(self, name: str, args: List[str], 
                 return_type: Optional[str], 
                 has_docstring: bool,
                 decorators: List[str],
                 source_lines: str):
        self.name = name
        self.args = args
        self.return_type = return_type
        self.has_docstring = has_docstring
        self.decorators = decorators
        self.source_lines = source_lines

def analyze_functions(source_code: str) -> List[FunctionInfo]:
    """Parse Python source code and extract function metadata."""
    tree = ast.parse(source_code)
    functions = []
    
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef):
            # Get function arguments
            args = [arg.arg for arg in node.args.args]
            
            # Get return type annotation
            return_type = None
            if node.returns:
                return_type = ast.unparse(node.returns)
            
            # Check for docstring
            has_docstring = (isinstance(node.body[0], ast.Expr) and 
                           isinstance(node.body[0].value, ast.Constant) and
                           isinstance(node.body[0].value.value, str))
            
            # Get decorators
            decorators = [ast.unparse(d) for d in node.decorator_list]
            
            # Get source lines
            source_lines = ast.unparse(node)
            
            functions.append(FunctionInfo(
                name=node.name,
                args=args,
                return_type=return_type,
                has_docstring=has_docstring,
                decorators=decorators,
                source_lines=source_lines
            ))
    
    return functions

This gives us structured metadata about every function. We know its name, what arguments it expects, what it returns, and whether it has documentation.

Step 2: The Dependency Detector

This is where we get smart. We need to know if a function makes external calls so we can mock them properly.

Add this to `analyzer.py`:

python
def detect_external_dependencies(source_code: str) -> Dict[str, List[str]]:
    """Detect external dependencies like DB calls, HTTP requests, file I/O."""
    dependencies = {
        'database': [],
        'http': [],
        'file_io': [],
        'external_libs': []
    }
    
    tree = ast.parse(source_code)
    
    for node in ast.walk(tree):
        # Detect database calls
        if isinstance(node, ast.Call):
            if isinstance(node.func, ast.Attribute):
                func_name = ast.unparse(node.func)
                if any(db in func_name.lower() for db in 
                       ['query', 'execute', 'session', 'cursor', 'commit']):
                    dependencies['database'].append(func_name)
                
                if any(http in func_name.lower() for http in 
                       ['get', 'post', 'put', 'delete', 'request']):
                    dependencies['http'].append(func_name)
        
        # Detect file operations
        if isinstance(node, ast.Call):
            if isinstance(node.func, ast.Name):
                if node.func.id in ['open', 'read', 'write']:
                    dependencies['file_io'].append(node.func.id)
        
        # Detect imports of external libraries
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            for alias in node.names:
                if alias.name not in ['os', 'sys', 'json', 'datetime']:
                    dependencies['external_libs'].append(alias.name)
    
    return dependencies

Why does this matter? Because a test that makes a real HTTP call is not a unit test. It’s an integration test masquerading as one. Our generator will know to mock these.

Step 3: The Prompt Builder

Now we construct a prompt that gives GPT-4o everything it needs to write great tests.

Create `prompt_builder.py`:

python
from analyzer import FunctionInfo

def build_test_prompt(function: FunctionInfo, 
                      dependencies: dict,
                      project_context: str = "") -> str:
    """Build a detailed prompt for test generation."""
    
    prompt = f"""You are an expert Python developer. Generate pytest unit tests for the following function.

FUNCTION NAME: {function.name}
ARGUMENTS: {', '.join(function.args)}
RETURN TYPE: {function.return_type if function.return_type else 'None'}
HAS DOCSTRING: {function.has_docstring}
DECORATORS: {', '.join(function.decorators) if function.decorators else 'None'}

SOURCE CODE:

{function.source_lines}



EXTERNAL DEPENDENCIES DETECTED:
- Database calls: {', '.join(dependencies['database']) if dependencies['database'] else 'None'}
- HTTP calls: {', '.join(dependencies['http']) if dependencies['http'] else 'None'}
- File I/O: {', '.join(dependencies['file_io']) if dependencies['file_io'] else 'None'}
- External libraries: {', '.join(dependencies['external_libs']) if dependencies['external_libs'] else 'None'}

PROJECT CONTEXT:
{project_context}

REQUIREMENTS:
1. Use pytest framework with proper assertions
2. Mock ALL external dependencies using unittest.mock
3. Cover edge cases: empty inputs, None values, boundary conditions
4. Include at least one test for the happy path
5. Include at least one test for error handling or edge cases
6. Use pytest fixtures for shared setup
7. Do NOT make any real network calls or database queries
8. Return ONLY valid Python code, no explanations

Generate the test code:"""
    
    return prompt

Notice what we’re doing here. We’re not just dumping code and hoping. We’re explicitly telling the model what dependencies to mock and what patterns to follow. This dramatically improves output quality.

Step 4: The Generator Engine

This is where the magic happens. We’ll call GPT-4o with our structured prompt and parse the response.

Create `generator.py`:

python
import os
from openai import OpenAI
from analyzer import analyze_functions, detect_external_dependencies
from prompt_builder import build_test_prompt
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def generate_tests(source_code: str, 
                   project_context: str = "",
                   model: str = "gpt-4o",
                   temperature: float = 0.2) -> str:
    """Generate unit tests for all functions in the source code."""
    
    # Analyze the source code
    functions = analyze_functions(source_code)
    dependencies = detect_external_dependencies(source_code)
    
    if not functions:
        return "# No functions found to test."
    
    all_tests = []
    
    for func in functions:
        print(f"Generating tests for: {func.name}...")
        
        prompt = build_test_prompt(func, dependencies, project_context)
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a senior Python developer specializing in writing clean, comprehensive unit tests."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=2000
        )
        
        test_code = response.choices[0].message.content
        
        # Clean up markdown code blocks if present
        if "```python" in test_code:
            test_code = test_code.split("```python")[1].split("```")[0]
        elif "```" in test_code:
            test_code = test_code.split("```")[1].split("```")[0]
        
        all_tests.append(test_code)
    
    # Combine all tests with imports
    combined = "import pytest\nfrom unittest.mock import patch, MagicMock\n\n"
    combined += "\n\n".join(all_tests)
    
    return combined

Why temperature 0.2? We want deterministic, reliable output for test generation. Higher temperatures produce creative but potentially incorrect tests. Low temperature keeps it grounded.

Step 5: The CLI Interface

Let’s make this usable from the command line.

Create `cli.py`:

python
import argparse
import sys
from generator import generate_tests

def main():
    parser = argparse.ArgumentParser(
        description="Generate AI-powered unit tests for Python code"
    )
    parser.add_argument("input_file", help="Path to Python source file")
    parser.add_argument("-o", "--output", help="Output file for generated tests")
    parser.add_argument("-c", "--context", help="Project context file", default="")
    parser.add_argument("-m", "--model", default="gpt-4o")
    
    args = parser.parse_args()
    
    # Read source code
    with open(args.input_file, 'r') as f:
        source_code = f.read()
    
    # Read project context if provided
    project_context = ""
    if args.context:
        with open(args.context, 'r') as f:
            project_context = f.read()
    
    # Generate tests
    print(f"Analyzing {args.input_file}...")
    test_code = generate_tests(source_code, project_context, args.model)
    
    # Output
    if args.output:
        with open(args.output, 'w') as f:
            f.write(test_code)
        print(f"Tests written to {args.output}")
    else:
        print(test_code)

if __name__ == "__main__":
    main()

Real-World Test: Let’s Run It

I tested this on a real service file from a project we built for a client in Can Tho. Here’s the function we analyzed:

python
def process_order(order_id: int, db_session) -> dict:
    """
    Process a customer order.
    Validates inventory, charges payment, and updates order status.
    """
    order = db_session.query(Order).filter_by(id=order_id).first()
    if not order:
        raise ValueError(f"Order {order_id} not found")
    
    if order.status != "pending":
        raise ValueError(f"Order {order_id} is already {order.status}")
    
    inventory = db_session.query(Inventory).filter_by(
        product_id=order.product_id
    ).first()
    
    if not inventory or inventory.quantity < order.quantity:
        raise ValueError(f"Insufficient inventory for order {order_id}")
    
    # Process payment via external API
    payment_result = payment_gateway.charge(
        amount=order.total,
        token=order.payment_token
    )
    
    if not payment_result["success"]:
        raise RuntimeError(f"Payment failed for order {order_id}")
    
    inventory.quantity -= order.quantity
    order.status = "completed"
    db_session.commit()
    
    return {"order_id": order_id, "status": "completed", "amount": order.total}

Here's what the generator produced: