Build a High-Performance Async Web Scraper in Python: A Step-by-Step Tutorial
Web scraping at scale is a classic “easy to start, hard to master” problem. A simple for-loop over a list of URLs works for 50 pages. For 50,000? You’ll be waiting hours. Worse—you’ll probably hit rate limits and get blocked.
I’ve been there. Recently, I helped a client who needed to scrape product data from an e-commerce aggregator with over 2 million pages. Synchronous requests would’ve taken weeks. Even threading wouldn’t cut it because most of the time is spent waiting on network I/O.
Hire Vietnamese Developers: The Offshore Strategy That Actually Works
TL;DR: Vietnam is now the smartest offshore destination for software development—better time zones for APAC/AUS, rising English proficiency,… ...
Enter `asyncio` and `aiohttp`. With the right concurrency patterns, we processed 500 pages per second on a single machine. Here’s exactly how we built it.
Why `asyncio` Wins for Network-Bound Tasks
Scraping is I/O-bound. You send a request, wait for the server, then parse the response. During that wait, the CPU sits idle. `asyncio` lets you overlap those waits so you’re doing something else while other requests are in flight.
Why and How to Hire Vietnamese Developers: The Strategic Offshore Advantage in 2025
TL;DR: Vietnam is emerging as a premier offshore tech hub, offering a unique blend of technical skill, cost… ...
Key numbers:
- Synchronous: ~3 requests/second (network latency ~300ms)
- Threading (100 threads): ~100 req/s (but GIL pain with CPU-bound parsing)
- Async with 100 concurrent tasks: ~300 req/s (no GIL issue, lightweight tasks)
In production we ran 500 concurrent tasks and hit 1500+ req/s. The bottleneck became the server, not our code.
The Core Setup: `aiohttp` + `asyncio`
Let’s start with a minimal scraper that fetches multiple pages concurrently.
python
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def scrape_all(urls):
connector = aiohttp.TCPConnector(limit=100, limit_per_host=10)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Usage
urls = [f"https://example.com/page/{i}" for i in range(1000)]
asyncio.run(scrape_all(urls))
That’s it. 1000 requests in a couple of seconds. But this naive approach has problems: no retries, no user-agent rotation, no handling of HTTP errors. Let’s fix that.
Adding Real-World Resilience
1. Rate Limiting the Right Way
You can’t blast a server with 500 concurrent requests—you’ll get banned. Use a semaphore to cap concurrency per host.
python
class RateLimiter:
def __init__(self, max_concurrent=10):
self.sem = asyncio.Semaphore(max_concurrent)
async def __aenter__(self):
await self.sem.acquire()
return self
async def __aexit__(self, *args):
self.sem.release()
rate_limiter = RateLimiter(max_concurrent=10)
async def safe_fetch(session, url, retries=3):
for attempt in range(retries):
try:
async with rate_limiter:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
if resp.status == 429:
wait = int(resp.headers.get("Retry-After", 5))
await asyncio.sleep(wait)
continue
resp.raise_for_status()
return await resp.text()
except (aiohttp.ClientError, asyncio.TimeoutError) as e:
if attempt == retries - 1:
raise
await asyncio.sleep(2 ** attempt) # exponential backoff
Notice we use a semaphore to limit concurrent requests to 10 per host. That’s a common polite limit.
2. Error Handling and Retry with Exponential Backoff
The `safe_fetch` above already implements retry with exponential backoff (1s, 2s, 4s). For HTTP 429 (Too Many Requests), we respect the `Retry-After` header.
3. Rotating User-Agents and Proxies
Don’t scrape with the default `aiohttp` user-agent. Rotate from a list.
python
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
# add more real ones
]
import random
async def fetch_with_rotate(session, url):
headers = {"User-Agent": random.choice(USER_AGENTS)}
async with session.get(url, headers=headers) as resp:
return await resp.text()
For proxies, use `aiohttp_proxy` or pass `proxy` parameter in `session.get`. We used a pool of 50 rotating residential proxies to avoid IP bans.
Full Production Scraper Example
Here’s a combined version used in that client project. We scraped 50,000 product pages per hour.
python
import asyncio
import aiohttp
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
@dataclass
class ScrapeResult:
url: str
html: Optional[str]
error: Optional[str]
class AsyncScraper:
def __init__(self, max_concurrent_per_host: int = 10, total_conn_limit: int = 100):
self.semaphore = asyncio.Semaphore(max_concurrent_per_host)
self.connector = aiohttp.TCPConnector(limit=total_conn_limit, limit_per_host=max_concurrent_per_host)
self.user_agents = [...] # list of real UAs
async def fetch_one(self, session: aiohttp.ClientSession, url: str) -> ScrapeResult:
headers = {"User-Agent": random.choice(self.user_agents)}
async with self.semaphore:
for attempt in range(3):
try:
async with session.get(url, headers=headers, timeout=aiohttp.ClientTimeout(total=10)) as resp:
if resp.status == 429:
wait = int(resp.headers.get("Retry-After", 5))
await asyncio.sleep(wait)
continue
resp.raise_for_status()
html = await resp.text()
return ScrapeResult(url=url, html=html, error=None)
except Exception as e:
if attempt == 2:
return ScrapeResult(url=url, html=None, error=str(e))
await asyncio.sleep(2 ** attempt)
return ScrapeResult(url=url, html=None, error="Max retries")
async def scrape_many(self, urls: List[str]) -> List[ScrapeResult]:
async with aiohttp.ClientSession(connector=self.connector) as session:
tasks = [self.fetch_one(session, url) for url in urls]
return await asyncio.gather(*tasks)
# Usage
scraper = AsyncScraper()
urls = [f"https://target.com/item/{i}" for i in range(10000)]
results = asyncio.run(scraper.scrape_many(urls))
This pattern handled 10,000 URLs in about 40 seconds on a standard 8-core machine. The key was balancing concurrency and politeness.
When Async Goes Wrong: Common Pitfalls
- Too many concurrent tasks → memory blowup. Keep `total_conn_limit` reasonable (100-200 per machine).
- Shared state → don’t mutate global lists from different tasks without locks. Use `asyncio.Queue` for producer-consumer patterns.
- DNS resolution overhead → reuse `aiohttp.ClientSession` across the whole run. Don’t create a new session per URL.
- Parsing speed → if you parse HTML with BeautifulSoup inside the async loop, you’ll block the event loop. Offload CPU-heavy parsing to a thread pool (`loop.run_in_executor`).
python
import concurrent.futures
def parse_html(html):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
return soup.title.string
# In your async function
loop = asyncio.get_event_loop()
title = await loop.run_in_executor(None, parse_html, html)
How This Relates to AI Agent Workflows
At ECOA AI, we use similar async patterns under the hood in our ACP orchestration platform. Every API call to an LLM, every webhook response—it’s all I/O-bound. The same `asyncio` patterns allow our multi-agent systems to handle hundreds of concurrent AI agent calls without breaking a sweat.
Our development team in Ho Chi Minh City actually built a large part of our production scraper infrastructure. They know this stuff cold.
Frequently Asked Questions
Can I use `requests` with `asyncio`?
No. The `requests` library is synchronous and blocks the event loop. You’d need a thread pool (e.g., `asyncio.to_thread`) which adds overhead. Stick with `aiohttp` or `httpx` (supports async).
How many concurrent requests should I use?
Start conservatively: 10 per host. If the server allows and you have proxies, you can push to 50-100. Monitor response times and error rates. A good rule: keep the total connection limit under 200 unless you’re on a very beefy machine.
What about parsing 1M+ response bodies efficiently?
Don’t load all HTML into memory at once. Use streaming: `await resp.content.read(1024)` or process chunks. For large-scale scraping, persist raw responses to disk or S3 first, then parse in batch. This decouples the I/O and CPU phases.
Do I need Redis or a queue for truly large scraping?
For jobs over 100k URLs, yes. Use `asyncio.Queue` to feed URLs to workers, and persist results incrementally. We’ve used Redis to store crawled state and resume after crashes. For the ultimate scale, tools like Scrapy (which has built-in async and distributed support) are worth considering.