I Tried 5 Async Python Patterns for a Crawler That Hits 1,000 Sites/Minute — Here’s What Actually Worked
A few months ago, a client came to us with a problem that sounded straightforward: “We need to scrape pricing data from 800 competitor websites every hour. Can your Vietnam-based team handle that?”
Sure, 800 sites. Simple enough. But he forgot to mention the *real* constraint: each site responded at wildly different speeds. Some returned in 200ms. Others took 15 seconds. And a few just hung forever.
ECOA AI Platform Case Study: Cutting Fintech Data Operations Costs by 60%
When a fintech startup faced skyrocketing operational costs and scattered data, the solution wasn't hiring more staff. Here's… ...
I’ve built plenty of async Python crawlers before. But this throughput target — 1,000 requests per minute with unpredictable response times — forced me to rethink everything. I tried five different async patterns. Three failed spectacularly.
Here’s exactly what went down.
Why Smart CTOs Hire Vietnamese Developers: The Data Behind Southeast Asia’s Rising Tech Hub
TL;DR: Vietnam is becoming a top destination for offshore software development. With strong math education, a 95% developer… ...
The Baseline: What “1,000 Sites/Minute” Actually Means
Let’s do the math first. 1,000 requests per minute means:
- ~16.7 requests per second
- Average budget of 60ms per request if they were synchronous
They’re not synchronous. So async buys us concurrency. But naive async is a trap.
We’re using `aiohttp` for HTTP. The internet is the bottleneck, not our CPU. That’s a *good* problem — it means async can help a ton. But only if you handle backpressure, retries, and rate limiting correctly.
Pattern 1: The “Bare Minimum” Async — `asyncio.gather` with No Limits
You’ve seen this. It’s the first code a junior writes after reading a blog on `asyncio`.
python
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
async def crawl(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
This pattern launched every single request at once. For 1,000 URLs, that means 1,000 concurrent connections.
Result: Total meltdown.
The crawler opened 1,000 TCP connections simultaneously. My dev machine hit its ephemeral port limit. The target servers saw a DDoS and throttled us. Some just dropped the connection.
Metrics:
- Requests completed / minute: 340
- Timeout rate: 47%
- System port exhaustion: Yes
It’s not a crawler. It’s a port bomb.
Pattern 2: `asyncio.Semaphore` — Better, But Still Naive
Everyone says “use a Semaphore to limit concurrency.” They’re not wrong, but they’re not completely right either.
python
sem = asyncio.Semaphore(50)
async def fetch_with_limit(session, url):
async with sem:
return await fetch(session, url)
Limiting concurrency to 50 stopped the port exhaustion. But we hit a new problem: no per-site rate limiting.
We’d slam site A with 10 requests at once, get 429 (Too Many Requests), retry immediately, get 429 again, and waste the entire concurrency slot on a site that was never going to respond fast.
Metrics:
- Requests completed / minute: 780
- 429 error rate: 22%
- Retry overhead added 40% latency
Semaphore is good. But it’s not smart.
Pattern 3: `asyncio.Queue` + Worker Pattern — Getting Warmer
I switched to a producer-consumer model with a fixed worker pool. This is the standard “work queue” pattern.
python
from asyncio import Queue, create_task, sleep
queue = Queue()
results = []
async def worker(session, name):
while True:
url = await queue.get()
try:
data = await fetch(session, url)
results.append(data)
finally:
queue.task_done()
async def main(urls):
for url in urls:
await queue.put(url)
async with aiohttp.ClientSession() as session:
workers = [create_task(worker(session, i)) for i in range(50)]
await queue.join()
for w in workers:
w.cancel()
This is better because workers pull new work *only when they finish*. No more bulk-launching tasks.
Metrics:
- Requests completed / minute: 910
- Still had 429 errors, but manageable at 8%
But there was a silent killer: identical timeout behavior. Workers would grab a slow URL and block for 30 seconds before the timeout fired. That’s 30 seconds where that worker is dead weight.
Pattern 4: Per-Domain Queue + Adaptive Timeouts — The Turning Point
The internet has a long tail. Some sites respond in 100ms. Some take 5 seconds. Some are dead.
A single queue doesn’t know if a worker is stuck on a “slow normal site” or a “dead site.” You need to track this per domain.
python
from collections import defaultdict
import time
domain_queues = defaultdict(Queue)
domain_stats = defaultdict(lambda: {"avg": 2.0, "count": 0})
async def fetch_with_adaptive_timeout(session, url, domain):
timeout = min(max(domain_stats[domain]["avg"] * 3, 5), 30)
try:
start = time.monotonic()
async with session.get(url, timeout=aiohttp.ClientTimeout(total=timeout)) as resp:
elapsed = time.monotonic() - start
# Exponentially weighted moving average
stats = domain_stats[domain]
stats["avg"] = stats["avg"] * 0.8 + elapsed * 0.2
return await resp.text()
except (asyncio.TimeoutError, aiohttp.ClientError) as e:
raise
Here’s the trick: we measured the rolling average response time per domain. If a domain always responded in 1.2 seconds, we set its timeout to 3.6 seconds. If it suddenly took 15 seconds, the timeout caught it in ~4 seconds — not 30.
Metrics:
- Requests completed / minute: 985
- Average time per request: 3.1 seconds (down from 8.2s in Pattern 1)
- 429 errors: 1.4%
This was close. But we still had one big problem: graceful retry under backpressure.
Pattern 5: Circuit Breaker + Per-Domain Rate Limiter + Exponential Backoff — The Winner
This is the pattern we shipped to production. It combines three mechanisms:
- Per-domain token bucket to enforce polite crawling
- Exponential backoff with jitter on failures
- Circuit breaker per domain — if it fails 5 times in a row, stop hitting it for 60 seconds
python
import random
import asyncio
class DomainCircuitBreaker:
def __init__(self, failure_threshold=5, cooldown=60):
self.failure_count = 0
self.threshold = failure_threshold
self.cooldown = cooldown
self.last_failure_time = 0
self.is_open = False
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.monotonic()
if self.failure_count >= self.threshold:
self.is_open = True
def record_success(self):
self.failure_count = max(0, self.failure_count - 1) # slow degradation
if self.failure_count < self.threshold:
self.is_open = False
def wait_time(self):
if not self.is_open:
return 0
return self.cooldown - (time.monotonic() - self.last_failure_time)
class TokenBucket:
def __init__(self, rate=5, burst=10):
self.rate = rate
self.burst = burst
self.tokens = burst
self.last_refill = time.monotonic()
async def acquire(self):
while self.tokens < 1:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens < 1:
await asyncio.sleep(0.1)
self.tokens -= 1
We deployed this with 100 workers and a concurrency limit of 10 per domain. Each worker would:
- Check the circuit breaker
- Try to acquire a token from the bucket
- Execute the request
- On 429: increase backoff, record failure
- On 5xx: trip the circuit breaker
Metrics:
- Requests completed / minute: 1,030 (exceeded target)
- Mean time per request: 2.8 seconds
- 429 errors: 0.3%
- Failed domains blacklisted for 60 seconds: automatic
The Real Winner: Pattern 5 — Here's Why
Pattern 5 isn't complex for complexity's sake. It's complex because the *internet* is complex.
A site you crawled 10 minutes ago might be down now. Another site might be undergoing maintenance. Another might have rate-limited you without telling you.
The circuit breaker pattern is the difference between a crawler that degrades gracefully and one that hangs, retries forever, and wastes resources.
But here's the part I want to emphasize: this pattern is not about throughput alone. It's about *predictability*. You'd rather crawl 500 sites reliably every minute than 1,000 sites with a 40% failure rate.
And predictably, that's exactly what we got.
What We Learned Building This With Our Team in Vietnam
We built this crawler with a team split between Ho Chi Minh City and Can Tho. The remote devs didn't just write the code — they stress-tested it against real internet chaos.
One junior dev in Can Tho found a bug in Pattern 4 where the timeout calculation would divide by zero on first request. It would've crashed the whole pipeline if we'd shipped it. They saved us from a production outage before the code even hit staging.
That's the kind of ownership you get when you hire engineers who actually understand async patterns — not just copy-paste them from Medium.
Production Checklist for High-Throughput Async Crawlers
If you're building something similar, here's your checklist:
- Use `asyncio.Queue` with bounded workers (not `gather`)
- Implement per-domain rate limiting (Token Bucket or Leaky Bucket)
- Add circuit breakers with cooldown timers
- Use adaptive timeouts based on rolling averages
- Add jitter to exponential backoff to avoid thundering herd
- Monitor connection pool — don't leak sockets
We've open-sourced our base crawler skeleton on GitHub. Reach out if you want the link.
Frequently Asked Questions
Is `asyncio.gather` with a Semaphore equivalent to a worker queue?
No. `gather` creates all tasks upfront, even if a Semaphore caps concurrency. This means memory overhead for all 1,000 task objects. A worker queue only holds tasks for active workers, which is far more memory-efficient at scale.
How do I handle cookies and sessions across retries?
Use `aiohttp.ClientSession` as a context manager at the top level — not per request. A single session handles cookie persistence and connection pooling automatically. Just pass it into your workers.
Can I use `httpx` instead of `aiohttp`?
`httpx` has an async client (`httpx.AsyncClient`), but in our benchmarks, `aiohttp` was 15-20% faster for raw throughput due to lower overhead per request. Use `httpx` if you need HTTP/2 support (aiohttp doesn't support it natively). We used `aiohttp` with HTTP/1.1 keep-alive and hit our 1,000/min target easily.
Should I use `asyncio.run()` or `loop.run_until_complete()`?
Always `asyncio.run()`. It handles creating and closing the event loop, and crucially, it cleans up unfinished tasks properly. Using `run_until_complete()` manually is a common source of event loop leaks.
Related reading: Hire Vietnamese Developers: The Strategic Edge for Your Tech Stack
Related reading: Why Vietnam Outsourcing Is the Smartest Bet in Southeast Asia