I Tried 5 Async Python Patterns for a Crawler That Hits 1,000 Sites/Minute — Here’s What Actually Worked
A few months ago, a client came to us with a problem that sounded straightforward: “We need to scrape pricing data from 800 competitor websites every hour. Can your Vietnam-based team handle that?”
Sure, 800 sites. Simple enough. But he forgot to mention the *real* constraint: each site responded at wildly different speeds. Some returned in 200ms. Others took 15 seconds. And a few just hung forever.
5 Docker Optimization Tips for Real Projects Nobody Tells You
Docker has changed how we deploy applications, but not everyone knows how to optimize Docker for real projects… ...
I’ve built plenty of async Python crawlers before. But this throughput target — 1,000 requests per minute with unpredictable response times — forced me to rethink everything. I tried five different async patterns. Three failed spectacularly.
Here’s exactly what went down.
Why You Should Hire Vietnamese Developers for Your Next Tech Project
TL;DR: Vietnam has emerged as a top offshore software development destination, offering competitive rates, strong English skills, a… ...
The Baseline: What “1,000 Sites/Minute” Actually Means
Let’s do the math first. 1,000 requests per minute means:
- ~16.7 requests per second
- Average budget of 60ms per request if they were synchronous
They’re not synchronous. So async buys us concurrency. But naive async is a trap.
We’re using `aiohttp` for HTTP. The internet is the bottleneck, not our CPU. That’s a *good* problem — it means async can help a ton. But only if you handle backpressure, retries, and rate limiting correctly.
Pattern 1: The “Bare Minimum” Async — `asyncio.gather` with No Limits
You’ve seen this. It’s the first code a junior writes after reading a blog on `asyncio`.
python
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
async def crawl(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
This pattern launched every single request at once. For 1,000 URLs, that means 1,000 concurrent connections.
Result: Total meltdown.
The crawler opened 1,000 TCP connections simultaneously. My dev machine hit its ephemeral port limit. The target servers saw a DDoS and throttled us. Some just dropped the connection.
Metrics:
- Requests completed / minute: 340
- Timeout rate: 47%
- System port exhaustion: Yes
It’s not a crawler. It’s a port bomb.
Pattern 2: `asyncio.Semaphore` — Better, But Still Naive
Everyone says “use a Semaphore to limit concurrency.” They’re not wrong, but they’re not completely right either.
python
sem = asyncio.Semaphore(50)
async def fetch_with_limit(session, url):
async with sem:
return await fetch(session, url)
Limiting concurrency to 50 stopped the port exhaustion. But we hit a new problem: no per-site rate limiting.
We’d slam site A with 10 requests at once, get 429 (Too Many Requests), retry immediately, get 429 again, and waste the entire concurrency slot on a site that was never going to respond fast.
Metrics:
- Requests completed / minute: 780
- 429 error rate: 22%
- Retry overhead added 40% latency
Semaphore is good. But it’s not smart.
Pattern 3: `asyncio.Queue` + Worker Pattern — Getting Warmer
I switched to a producer-consumer model with a fixed worker pool. This is the standard “work queue” pattern.
python
from asyncio import Queue, create_task, sleep
queue = Queue()
results = []
async def worker(session, name):
while True:
url = await queue.get()
try:
data = await fetch(session, url)
results.append(data)
finally:
queue.task_done()
async def main(urls):
for url in urls:
await queue.put(url)
async with aiohttp.ClientSession() as session:
workers = [create_task(worker(session, i)) for i in range(50)]
await queue.join()
for w in workers:
w.cancel()
This is better because workers pull new work *only when they finish*. No more bulk-launching tasks.
Metrics:
- Requests completed / minute: 910
- Still had 429 errors, but manageable at 8%
But there was a silent killer: identical timeout behavior. Workers would grab a slow URL and block for 30 seconds before the timeout fired. That’s 30 seconds where that worker is dead weight.
Pattern 4: Per-Domain Queue + Adaptive Timeouts — The Turning Point
The internet has a long tail. Some sites respond in 100ms. Some take 5 seconds. Some are dead.
A single queue doesn’t know if a worker is stuck on a “slow normal site” or a “dead site.” You need to track this per domain.
python
from collections import defaultdict
import time
domain_queues = defaultdict(Queue)
domain_stats = defaultdict(lambda: {"avg": 2.0, "count": 0})
async def fetch_with_adaptive_timeout(session, url, domain):
timeout = min(max(domain_stats[domain]["avg"] * 3, 5), 30)
try:
start = time.monotonic()
async with session.get(url, timeout=aiohttp.ClientTimeout(total=timeout)) as resp:
elapsed = time.monotonic() - start
# Exponentially weighted moving average
stats = domain_stats[domain]
stats["avg"] = stats["avg"] * 0.8 + elapsed * 0.2
return await resp.text()
except (asyncio.TimeoutError, aiohttp.ClientError) as e:
raise
Here’s the trick: we measured the rolling average response time per domain. If a domain always responded in 1.2 seconds, we set its timeout to 3.6 seconds. If it suddenly took 15 seconds, the timeout caught it in ~4 seconds — not 30.
Metrics:
- Requests completed / minute: 985
- Average time per request: 3.1 seconds (down from 8.2s in Pattern 1)
- 429 errors: 1.4%
This was close. But we still had one big problem: graceful retry under backpressure.
Pattern 5: Circuit Breaker + Per-Domain Rate Limiter + Exponential Backoff — The Winner
This is the pattern we shipped to production. It combines three mechanisms:
- Per-domain token bucket to enforce polite crawling
- Exponential backoff with jitter on failures
- Circuit breaker per domain — if it fails 5 times in a row, stop hitting it for 60 seconds
python
import random
import asyncio
class DomainCircuitBreaker:
def __init__(self, failure_threshold=5, cooldown=60):
self.failure_count = 0
self.threshold = failure_threshold
self.cooldown = cooldown
self.last_failure_time = 0
self.is_open = False
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.monotonic()
if self.failure_count >= self.threshold:
self.is_open = True
def record_success(self):
self.failure_count = max(0, self.failure_count - 1) # slow degradation
if self.failure_count < self.threshold:
self.is_open = False
def wait_time(self):
if not self.is_open:
return 0
return self.cooldown - (time.monotonic() - self.last_failure_time)
class TokenBucket:
def __init__(self, rate=5, burst=10):
self.rate = rate
self.burst = burst
self.tokens = burst
self.last_refill = time.monotonic()
async def acquire(self):
while self.tokens < 1:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens < 1:
await asyncio.sleep(0.1)
self.tokens -= 1
We deployed this with 100 workers and a concurrency limit of 10 per domain. Each worker would:
- Check the circuit breaker
- Try to acquire a token from the bucket
- Execute the request
- On 429: increase backoff, record failure
- On 5xx: trip the circuit breaker
Metrics:
- Requests completed / minute: 1,030 (exceeded target)
- Mean time per request: 2.8 seconds
- 429 errors: 0.3%
- Failed domains blacklisted for 60 seconds: automatic
The Real Winner: Pattern 5 — Here's Why
Pattern 5 isn't complex for complexity's sake. It's complex because the *internet* is complex.
A site you crawled 10 minutes ago might be down now. Another site might be undergoing maintenance. Another might have rate-limited you without telling you.
The circuit breaker pattern is the difference between a crawler that degrades gracefully and one that hangs, retries forever, and wastes resources.
But here's the part I want to emphasize: this pattern is not about throughput alone. It's about *predictability*. You'd rather crawl 500 sites reliably every minute than 1,000 sites with a 40% failure rate.
And predictably, that's exactly what we got.
What We Learned Building This With Our Team in Vietnam
We built this crawler with a team split between Ho Chi Minh City and Can Tho. The remote devs didn't just write the code — they stress-tested it against real internet chaos.
One junior dev in Can Tho found a bug in Pattern 4 where the timeout calculation would divide by zero on first request. It would've crashed the whole pipeline if we'd shipped it. They saved us from a production outage before the code even hit staging.
That's the kind of ownership you get when you hire engineers who actually understand async patterns — not just copy-paste them from Medium.
Production Checklist for High-Throughput Async Crawlers
If you're building something similar, here's your checklist:
- Use `asyncio.Queue` with bounded workers (not `gather`)
- Implement per-domain rate limiting (Token Bucket or Leaky Bucket)
- Add circuit breakers with cooldown timers
- Use adaptive timeouts based on rolling averages
- Add jitter to exponential backoff to avoid thundering herd
- Monitor connection pool — don't leak sockets
We've open-sourced our base crawler skeleton on GitHub. Reach out if you want the link.
Frequently Asked Questions
Is `asyncio.gather` with a Semaphore equivalent to a worker queue?
No. `gather` creates all tasks upfront, even if a Semaphore caps concurrency. This means memory overhead for all 1,000 task objects. A worker queue only holds tasks for active workers, which is far more memory-efficient at scale.
How do I handle cookies and sessions across retries?
Use `aiohttp.ClientSession` as a context manager at the top level — not per request. A single session handles cookie persistence and connection pooling automatically. Just pass it into your workers.
Can I use `httpx` instead of `aiohttp`?
`httpx` has an async client (`httpx.AsyncClient`), but in our benchmarks, `aiohttp` was 15-20% faster for raw throughput due to lower overhead per request. Use `httpx` if you need HTTP/2 support (aiohttp doesn't support it natively). We used `aiohttp` with HTTP/1.1 keep-alive and hit our 1,000/min target easily.
Should I use `asyncio.run()` or `loop.run_until_complete()`?
Always `asyncio.run()`. It handles creating and closing the event loop, and crucially, it cleans up unfinished tasks properly. Using `run_until_complete()` manually is a common source of event loop leaks.
Related reading: Hire Vietnamese Developers: The Strategic Edge for Your Tech Stack
Related reading: Why Vietnam Outsourcing Is the Smartest Bet in Southeast Asia