Docker Compose for Local AI Development: How I Spin Up a Full Multi-Agent Stack in Under 60 Seconds

1 comment
(Developer Tutorials) - Stop fighting Python virtual environments and conflicting CUDA versions. Here's my exact Docker Compose setup for running Ollama, Qdrant, Redis, and a custom agent service locally—with the Compose file you can copy and run today.

Docker Compose for Local AI Development: How I Spin Up a Full Multi-Agent Stack in Under 60 Seconds

Let’s be real. Setting up a local development environment for AI agents is a pain.

Python virtual environments, conflicting CUDA versions, system dependencies that break your machine. I’ve wasted more hours debugging `libcudart.so` errors than I care to admit.

Build a Custom AI Code Review Agent: A Step-by-Step Tutorial with ECOA AI Platform ACP

Build a Custom AI Code Review Agent: A Step-by-Step Tutorial with ECOA AI Platform ACP

Build a Custom AI Code Review Agent: A Step-by-Step Tutorial with ECOA AI Platform ACP Let’s be honest.… ...

So I did what any self-respecting engineer would do. I threw it all into a Docker Compose file.

Here’s the exact stack I use daily for building multi-agent systems with local LLMs, vector search, and message queues. No cloud costs. No SSH tunneling. Just `docker compose up` and you’re running.

Why Your Code Reviews Are Still Slow (And How AI Finally Fixes Them)

Why Your Code Reviews Are Still Slow (And How AI Finally Fixes Them)

TL;DR: Manual code reviews create bottlenecks that slow teams by 40% or more. AI code review automation tools… ...

Why Bother With Docker Compose for AI Development?

You could install everything natively. But why would you?

  • Isolation: Your Ollama version won’t clash with your Python dependencies.
  • Reproducibility: That junior dev in Ho Chi Minh City can pull the same stack in 30 seconds.
  • Clean teardown: `docker compose down -v` wipes everything. No zombie processes.

I work with a remote team based out of Can Tho, Vietnam. Every new hire gets our Compose file. Two hours of onboarding turned into five minutes.

The Exact Stack: What’s Running and Why

Here’s what I run locally before any code hits production:

Service Role Why It Matters
Ollama Local LLM server Runs Mistral, Llama, or Qwen on your GPU
Qdrant Vector database Stores embeddings for RAG pipelines
Redis Message broker + cache Agent state management and task queues
Agent API Your custom Python service The actual multi-agent orchestration logic

Important: The agent service depends on the other three. We’ll use health checks to enforce that.

Step 1: The Docker Compose File

Create a `docker-compose.yml` in your project root:

yaml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ecoai-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=24h
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 5

  qdrant:
    image: qdrant/qdrant:latest
    container_name: ecoai-qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_storage:/qdrant/storage
    environment:
      - QDRANT__SERVICE__GRPC_PORT=6334
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/health"]
      interval: 30s
      timeout: 10s
      retries: 5

  redis:
    image: redis:7-alpine
    container_name: ecoai-redis
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes --save 60 1
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s
      timeout: 10s
      retries: 5

  agent-api:
    build: .
    container_name: ecoai-agent-api
    ports:
      - "8000:8000"
    volumes:
      - ./src:/app/src
      - ./models:/app/models
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
      - LOG_LEVEL=DEBUG
    depends_on:
      ollama:
        condition: service_healthy
      qdrant:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload

volumes:
  ollama_data:
  qdrant_storage:
  redis_data:

Three things this file does right:

  1. GPU passthrough: The `deploy.resources` block makes your NVIDIA GPU available to both Ollama and your agent API. Without this, local LLM inference runs on CPU and takes 30 seconds per token.
  2. Health checks: Your agent service won’t start until Ollama, Qdrant, and Redis are actually responding. This prevents those annoying “connection refused” race conditions.
  3. Volume persistence: Embeddings, models, and Redis data survive container restarts. You don’t want to re-pull a 7GB model every time you change a line of code.

Step 2: The Python Agent Service (minimal example)

Your `Dockerfile`:

dockerfile
FROM python:3.11-slim

WORKDIR /app

RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ .

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

And your `requirements.txt`:


fastapi
uvicorn
httpx
qdrant-client
redis
pydantic

Now for the actual agent code. This is the simplest possible multi-agent orchestrator you can build:

python
# src/main.py
from fastapi import FastAPI
from pydantic import BaseModel
import httpx
import os

app = FastAPI(title="ECOAI Local Agent Stack")

OLLAMA_URL = os.getenv("OLLAMA_BASE_URL", "http://ollama:11434")

class Query(BaseModel):
    question: str
    context: str = ""

@app.post("/agent/query")
async def agent_query(query: Query):
    """Route a query through a local LLM agent."""
    prompt = f"""You are a senior developer answering a technical question.
    
Context: {query.context}
Question: {query.question}

Answer concisely and provide code examples where relevant."""

    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(
            f"{OLLAMA_URL}/api/generate",
            json={
                "model": "mistral:7b",
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.3,
                    "top_p": 0.9
                }
            }
        )
        data = response.json()
        return {"response": data["response"], "model": "mistral:7b"}

@app.get("/health")
async def health():
    return {"status": "ok", "services": {"ollama": True, "qdrant": True, "redis": True}}

What this does: A single API endpoint that calls Ollama with a structured prompt. That’s it. From here, you can add agent routing, vector search, or a Redis-backed task queue.

Step 3: Pull a Model and Run It

First run: you need to pull a model.

bash
# Start everything
docker compose up -d

# Pull Mistral 7B (3.8GB download)
docker exec -it ecoai-ollama ollama pull mistral:7b

# Test the agent
curl -X POST http://localhost:8000/agent/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Write a Python function to merge two sorted lists"}'

You’ll get a response in 2-3 seconds on a modern GPU. On CPU? More like 20-30 seconds. That’s why the GPU passthrough matters.

Honestly, I keep both Mistral 7B and Qwen2.5:7B pulled locally. Mistral for fast iterations, Qwen for when I need better code generation.

Where This Breaks (And How to Fix It)

This stack isn’t perfect. Here are the three things that’ll bite you:

1. Docker Desktop on macOS has no GPU support

You can’t pass an Apple Silicon GPU to Docker containers. Solution: Run Ollama natively on the host, and expose it via `localhost`. Change `OLLAMA_BASE_URL` to `http://host.docker.internal:11434`.

2. NVIDIA Container Toolkit isn’t installed

If you get `docker: Error response from daemon: could not select device driver “” with capabilities: [[gpu]].`, it means you haven’t installed the toolkit.

bash
# Ubuntu/Debian
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

3. Port conflicts

If you already have Redis or Qdrant running locally, change the host port:

yaml
ports:
  - "6380:6379"  # Redis on host port 6380

Making This Production-Ready

This is fine for local dev. But when you deploy to production, you’ll want:

  • A reverse proxy (Traefik or Nginx) in front of the agent API
  • Persistent volumes on a NAS for vector storage
  • Log aggregation via Loki or a centralized ELK stack
  • CPU limits on the agent service to prevent resource starvation

Our team at ECOA AI runs this exact pattern in production, but orchestrated via Kubernetes with horizontal pod autoscaling. The Compose file is our local dev mirror of the production environment.

Why I’ll Never Go Back to Native Setup

I used to think Docker was overkill for AI development. “It’s just Python,” I told myself. Then I spent three hours debugging a PyTorch CUDA mismatch between my laptop and our CI server.

Never again.

With Docker Compose for local AI development, I can:

  • Hand this stack to a junior developer in Vietnam and have them productive in 15 minutes
  • Blow away my entire environment and rebuild from scratch in 60 seconds
  • Run multiple agent stacks side-by-side without conflicts

You want to build multi-agent systems? Stop fighting your environment. Containerize everything.

Now go write some agents.

Frequently Asked Questions

How do I use a different LLM model with this stack?

Change the model name in the POST request to Ollama. Run `docker exec -it ecoai-ollama ollama pull llama3.1:8b` to download it, then update the `model` field in your code. The container handles all model management—you just call the API.

Can I run this without a GPU?

Yes, but expect 10-20x slower inference. Remove the entire `deploy.resources` block from both the `ollama` and `agent-api` services. Ollama will fall back to CPU automatically. For serious development, invest in an NVIDIA RTX 3090 or better.

How do I add vector search to this stack?

Your agent service already has Qdrant running on port 6333. Install `qdrant-client` in your requirements, connect using `QDRANT_URL=http://qdrant:6333`, and create collections for your embeddings. The vector store is fully persistent across restarts.

Can multiple developers share this stack over the network?

Yes, but don’t expose it directly to the internet. Use Tailscale or a VPN. Change the Redis password in the Compose file, set up authentication for Qdrant, and bind services to a Docker overlay network instead of exposing ports. Security isn’t optional for shared environments.

Related reading: Why Smart Tech Leaders Hire Vietnamese Developers in 2025

Related reading: Why Vietnam Outsourcing Is the Smartest Bet for Your Next Software Project

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.