I Tried 5 Async Python Patterns for a Crawler That Hits 1,000 Sites/Minute — Here’s What Actually Worked

1 comment
(Developer Tutorials) - I benchmarked 5 async patterns for a high-speed web crawler targeting 1,000 requests per minute. Only one survived the real internet's chaos. Here's the code, the data, and the hard lesson.

I Tried 5 Async Python Patterns for a Crawler That Hits 1,000 Sites/Minute — Here’s What Actually Worked

A few months ago, a client came to us with a problem that sounded straightforward: “We need to scrape pricing data from 800 competitor websites every hour. Can your Vietnam-based team handle that?”

Sure, 800 sites. Simple enough. But he forgot to mention the *real* constraint: each site responded at wildly different speeds. Some returned in 200ms. Others took 15 seconds. And a few just hung forever.

ECOA AI Platform Case Study: Cutting Fintech Data Operations Costs by 60%

ECOA AI Platform Case Study: Cutting Fintech Data Operations Costs by 60%

When a fintech startup faced skyrocketing operational costs and scattered data, the solution wasn't hiring more staff. Here's… ...

I’ve built plenty of async Python crawlers before. But this throughput target — 1,000 requests per minute with unpredictable response times — forced me to rethink everything. I tried five different async patterns. Three failed spectacularly.

Here’s exactly what went down.

Why Smart CTOs Hire Vietnamese Developers: The Data Behind Southeast Asia’s Rising Tech Hub

Why Smart CTOs Hire Vietnamese Developers: The Data Behind Southeast Asia’s Rising Tech Hub

TL;DR: Vietnam is becoming a top destination for offshore software development. With strong math education, a 95% developer… ...

The Baseline: What “1,000 Sites/Minute” Actually Means

Let’s do the math first. 1,000 requests per minute means:

  • ~16.7 requests per second
  • Average budget of 60ms per request if they were synchronous

They’re not synchronous. So async buys us concurrency. But naive async is a trap.

We’re using `aiohttp` for HTTP. The internet is the bottleneck, not our CPU. That’s a *good* problem — it means async can help a ton. But only if you handle backpressure, retries, and rate limiting correctly.

Pattern 1: The “Bare Minimum” Async — `asyncio.gather` with No Limits

You’ve seen this. It’s the first code a junior writes after reading a blog on `asyncio`.

python
import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as resp:
        return await resp.text()

async def crawl(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

This pattern launched every single request at once. For 1,000 URLs, that means 1,000 concurrent connections.

Result: Total meltdown.

The crawler opened 1,000 TCP connections simultaneously. My dev machine hit its ephemeral port limit. The target servers saw a DDoS and throttled us. Some just dropped the connection.

Metrics:

  • Requests completed / minute: 340
  • Timeout rate: 47%
  • System port exhaustion: Yes

It’s not a crawler. It’s a port bomb.

Pattern 2: `asyncio.Semaphore` — Better, But Still Naive

Everyone says “use a Semaphore to limit concurrency.” They’re not wrong, but they’re not completely right either.

python
sem = asyncio.Semaphore(50)

async def fetch_with_limit(session, url):
    async with sem:
        return await fetch(session, url)

Limiting concurrency to 50 stopped the port exhaustion. But we hit a new problem: no per-site rate limiting.

We’d slam site A with 10 requests at once, get 429 (Too Many Requests), retry immediately, get 429 again, and waste the entire concurrency slot on a site that was never going to respond fast.

Metrics:

  • Requests completed / minute: 780
  • 429 error rate: 22%
  • Retry overhead added 40% latency

Semaphore is good. But it’s not smart.

Pattern 3: `asyncio.Queue` + Worker Pattern — Getting Warmer

I switched to a producer-consumer model with a fixed worker pool. This is the standard “work queue” pattern.

python
from asyncio import Queue, create_task, sleep

queue = Queue()
results = []

async def worker(session, name):
    while True:
        url = await queue.get()
        try:
            data = await fetch(session, url)
            results.append(data)
        finally:
            queue.task_done()

async def main(urls):
    for url in urls:
        await queue.put(url)

    async with aiohttp.ClientSession() as session:
        workers = [create_task(worker(session, i)) for i in range(50)]
        await queue.join()
        for w in workers:
            w.cancel()

This is better because workers pull new work *only when they finish*. No more bulk-launching tasks.

Metrics:

  • Requests completed / minute: 910
  • Still had 429 errors, but manageable at 8%

But there was a silent killer: identical timeout behavior. Workers would grab a slow URL and block for 30 seconds before the timeout fired. That’s 30 seconds where that worker is dead weight.

Pattern 4: Per-Domain Queue + Adaptive Timeouts — The Turning Point

The internet has a long tail. Some sites respond in 100ms. Some take 5 seconds. Some are dead.

A single queue doesn’t know if a worker is stuck on a “slow normal site” or a “dead site.” You need to track this per domain.

python
from collections import defaultdict
import time

domain_queues = defaultdict(Queue)
domain_stats = defaultdict(lambda: {"avg": 2.0, "count": 0})

async def fetch_with_adaptive_timeout(session, url, domain):
    timeout = min(max(domain_stats[domain]["avg"] * 3, 5), 30)
    try:
        start = time.monotonic()
        async with session.get(url, timeout=aiohttp.ClientTimeout(total=timeout)) as resp:
            elapsed = time.monotonic() - start
            # Exponentially weighted moving average
            stats = domain_stats[domain]
            stats["avg"] = stats["avg"] * 0.8 + elapsed * 0.2
            return await resp.text()
    except (asyncio.TimeoutError, aiohttp.ClientError) as e:
        raise

Here’s the trick: we measured the rolling average response time per domain. If a domain always responded in 1.2 seconds, we set its timeout to 3.6 seconds. If it suddenly took 15 seconds, the timeout caught it in ~4 seconds — not 30.

Metrics:

  • Requests completed / minute: 985
  • Average time per request: 3.1 seconds (down from 8.2s in Pattern 1)
  • 429 errors: 1.4%

This was close. But we still had one big problem: graceful retry under backpressure.

Pattern 5: Circuit Breaker + Per-Domain Rate Limiter + Exponential Backoff — The Winner

This is the pattern we shipped to production. It combines three mechanisms:

  1. Per-domain token bucket to enforce polite crawling
  2. Exponential backoff with jitter on failures
  3. Circuit breaker per domain — if it fails 5 times in a row, stop hitting it for 60 seconds
python
import random
import asyncio

class DomainCircuitBreaker:
    def __init__(self, failure_threshold=5, cooldown=60):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.cooldown = cooldown
        self.last_failure_time = 0
        self.is_open = False

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.monotonic()
        if self.failure_count >= self.threshold:
            self.is_open = True

    def record_success(self):
        self.failure_count = max(0, self.failure_count - 1) # slow degradation
        if self.failure_count < self.threshold:
            self.is_open = False

    def wait_time(self):
        if not self.is_open:
            return 0
        return self.cooldown - (time.monotonic() - self.last_failure_time)

class TokenBucket:
    def __init__(self, rate=5, burst=10):
        self.rate = rate
        self.burst = burst
        self.tokens = burst
        self.last_refill = time.monotonic()

    async def acquire(self):
        while self.tokens < 1:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
            self.last_refill = now
            if self.tokens < 1:
                await asyncio.sleep(0.1)
        self.tokens -= 1

We deployed this with 100 workers and a concurrency limit of 10 per domain. Each worker would:

  • Check the circuit breaker
  • Try to acquire a token from the bucket
  • Execute the request
  • On 429: increase backoff, record failure
  • On 5xx: trip the circuit breaker

Metrics:

  • Requests completed / minute: 1,030 (exceeded target)
  • Mean time per request: 2.8 seconds
  • 429 errors: 0.3%
  • Failed domains blacklisted for 60 seconds: automatic

The Real Winner: Pattern 5 — Here's Why

Pattern 5 isn't complex for complexity's sake. It's complex because the *internet* is complex.

A site you crawled 10 minutes ago might be down now. Another site might be undergoing maintenance. Another might have rate-limited you without telling you.

The circuit breaker pattern is the difference between a crawler that degrades gracefully and one that hangs, retries forever, and wastes resources.

But here's the part I want to emphasize: this pattern is not about throughput alone. It's about *predictability*. You'd rather crawl 500 sites reliably every minute than 1,000 sites with a 40% failure rate.

And predictably, that's exactly what we got.

What We Learned Building This With Our Team in Vietnam

We built this crawler with a team split between Ho Chi Minh City and Can Tho. The remote devs didn't just write the code — they stress-tested it against real internet chaos.

One junior dev in Can Tho found a bug in Pattern 4 where the timeout calculation would divide by zero on first request. It would've crashed the whole pipeline if we'd shipped it. They saved us from a production outage before the code even hit staging.

That's the kind of ownership you get when you hire engineers who actually understand async patterns — not just copy-paste them from Medium.

Production Checklist for High-Throughput Async Crawlers

If you're building something similar, here's your checklist:

  • Use `asyncio.Queue` with bounded workers (not `gather`)
  • Implement per-domain rate limiting (Token Bucket or Leaky Bucket)
  • Add circuit breakers with cooldown timers
  • Use adaptive timeouts based on rolling averages
  • Add jitter to exponential backoff to avoid thundering herd
  • Monitor connection pool — don't leak sockets

We've open-sourced our base crawler skeleton on GitHub. Reach out if you want the link.

Frequently Asked Questions

Is `asyncio.gather` with a Semaphore equivalent to a worker queue?

No. `gather` creates all tasks upfront, even if a Semaphore caps concurrency. This means memory overhead for all 1,000 task objects. A worker queue only holds tasks for active workers, which is far more memory-efficient at scale.

How do I handle cookies and sessions across retries?

Use `aiohttp.ClientSession` as a context manager at the top level — not per request. A single session handles cookie persistence and connection pooling automatically. Just pass it into your workers.

Can I use `httpx` instead of `aiohttp`?

`httpx` has an async client (`httpx.AsyncClient`), but in our benchmarks, `aiohttp` was 15-20% faster for raw throughput due to lower overhead per request. Use `httpx` if you need HTTP/2 support (aiohttp doesn't support it natively). We used `aiohttp` with HTTP/1.1 keep-alive and hit our 1,000/min target easily.

Should I use `asyncio.run()` or `loop.run_until_complete()`?

Always `asyncio.run()`. It handles creating and closing the event loop, and crucially, it cleans up unfinished tasks properly. Using `run_until_complete()` manually is a common source of event loop leaks.

Related reading: Hire Vietnamese Developers: The Strategic Edge for Your Tech Stack

Related reading: Why Vietnam Outsourcing Is the Smartest Bet in Southeast Asia

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.