Build a High-Performance Async Web Scraper in Python: A Step-by-Step Tutorial

1 comment
(Developer Tutorials) - Learn how to build a production-ready async web scraper using `aiohttp` and `asyncio`. We'll cover rate limiting, retry logic, and real-world concurrency patterns that handle 1000+ requests per second without breaking a sweat.

Build a High-Performance Async Web Scraper in Python: A Step-by-Step Tutorial

Web scraping at scale is a classic “easy to start, hard to master” problem. A simple for-loop over a list of URLs works for 50 pages. For 50,000? You’ll be waiting hours. Worse—you’ll probably hit rate limits and get blocked.

I’ve been there. Recently, I helped a client who needed to scrape product data from an e-commerce aggregator with over 2 million pages. Synchronous requests would’ve taken weeks. Even threading wouldn’t cut it because most of the time is spent waiting on network I/O.

I Benchmarked 5 AI Coding Tools on a Real Production Bug — Only 1 Survived

I Benchmarked 5 AI Coding Tools on a Real Production Bug — Only 1 Survived

I Benchmarked 5 AI Coding Tools on a Real Production Bug — Only 1 Survived Let’s be honest.… ...

Enter `asyncio` and `aiohttp`. With the right concurrency patterns, we processed 500 pages per second on a single machine. Here’s exactly how we built it.

Why `asyncio` Wins for Network-Bound Tasks

Scraping is I/O-bound. You send a request, wait for the server, then parse the response. During that wait, the CPU sits idle. `asyncio` lets you overlap those waits so you’re doing something else while other requests are in flight.

Checklist for Hiring Offshore Developer Teams: A Guide for Tech Leaders

Checklist for Hiring Offshore Developer Teams: A Guide for Tech Leaders

Hiring offshore developer teams can accelerate product delivery and reduce costs, but it introduces risks around security, IP… ...

Key numbers:

  • Synchronous: ~3 requests/second (network latency ~300ms)
  • Threading (100 threads): ~100 req/s (but GIL pain with CPU-bound parsing)
  • Async with 100 concurrent tasks: ~300 req/s (no GIL issue, lightweight tasks)

In production we ran 500 concurrent tasks and hit 1500+ req/s. The bottleneck became the server, not our code.

The Core Setup: `aiohttp` + `asyncio`

Let’s start with a minimal scraper that fetches multiple pages concurrently.

python
import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_all(urls):
    connector = aiohttp.TCPConnector(limit=100, limit_per_host=10)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

# Usage
urls = [f"https://example.com/page/{i}" for i in range(1000)]
asyncio.run(scrape_all(urls))

That’s it. 1000 requests in a couple of seconds. But this naive approach has problems: no retries, no user-agent rotation, no handling of HTTP errors. Let’s fix that.

Adding Real-World Resilience

1. Rate Limiting the Right Way

You can’t blast a server with 500 concurrent requests—you’ll get banned. Use a semaphore to cap concurrency per host.

python
class RateLimiter:
    def __init__(self, max_concurrent=10):
        self.sem = asyncio.Semaphore(max_concurrent)

    async def __aenter__(self):
        await self.sem.acquire()
        return self

    async def __aexit__(self, *args):
        self.sem.release()

rate_limiter = RateLimiter(max_concurrent=10)

async def safe_fetch(session, url, retries=3):
    for attempt in range(retries):
        try:
            async with rate_limiter:
                async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                    if resp.status == 429:
                        wait = int(resp.headers.get("Retry-After", 5))
                        await asyncio.sleep(wait)
                        continue
                    resp.raise_for_status()
                    return await resp.text()
        except (aiohttp.ClientError, asyncio.TimeoutError) as e:
            if attempt == retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)  # exponential backoff

Notice we use a semaphore to limit concurrent requests to 10 per host. That’s a common polite limit.

2. Error Handling and Retry with Exponential Backoff

The `safe_fetch` above already implements retry with exponential backoff (1s, 2s, 4s). For HTTP 429 (Too Many Requests), we respect the `Retry-After` header.

3. Rotating User-Agents and Proxies

Don’t scrape with the default `aiohttp` user-agent. Rotate from a list.

python
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...",
    # add more real ones
]

import random

async def fetch_with_rotate(session, url):
    headers = {"User-Agent": random.choice(USER_AGENTS)}
    async with session.get(url, headers=headers) as resp:
        return await resp.text()

For proxies, use `aiohttp_proxy` or pass `proxy` parameter in `session.get`. We used a pool of 50 rotating residential proxies to avoid IP bans.

Full Production Scraper Example

Here’s a combined version used in that client project. We scraped 50,000 product pages per hour.

python
import asyncio
import aiohttp
from typing import List, Dict, Any, Optional
from dataclasses import dataclass

@dataclass
class ScrapeResult:
    url: str
    html: Optional[str]
    error: Optional[str]

class AsyncScraper:
    def __init__(self, max_concurrent_per_host: int = 10, total_conn_limit: int = 100):
        self.semaphore = asyncio.Semaphore(max_concurrent_per_host)
        self.connector = aiohttp.TCPConnector(limit=total_conn_limit, limit_per_host=max_concurrent_per_host)
        self.user_agents = [...]  # list of real UAs

    async def fetch_one(self, session: aiohttp.ClientSession, url: str) -> ScrapeResult:
        headers = {"User-Agent": random.choice(self.user_agents)}
        async with self.semaphore:
            for attempt in range(3):
                try:
                    async with session.get(url, headers=headers, timeout=aiohttp.ClientTimeout(total=10)) as resp:
                        if resp.status == 429:
                            wait = int(resp.headers.get("Retry-After", 5))
                            await asyncio.sleep(wait)
                            continue
                        resp.raise_for_status()
                        html = await resp.text()
                        return ScrapeResult(url=url, html=html, error=None)
                except Exception as e:
                    if attempt == 2:
                        return ScrapeResult(url=url, html=None, error=str(e))
                    await asyncio.sleep(2 ** attempt)
        return ScrapeResult(url=url, html=None, error="Max retries")

    async def scrape_many(self, urls: List[str]) -> List[ScrapeResult]:
        async with aiohttp.ClientSession(connector=self.connector) as session:
            tasks = [self.fetch_one(session, url) for url in urls]
            return await asyncio.gather(*tasks)

# Usage
scraper = AsyncScraper()
urls = [f"https://target.com/item/{i}" for i in range(10000)]
results = asyncio.run(scraper.scrape_many(urls))

This pattern handled 10,000 URLs in about 40 seconds on a standard 8-core machine. The key was balancing concurrency and politeness.

When Async Goes Wrong: Common Pitfalls

  • Too many concurrent tasks → memory blowup. Keep `total_conn_limit` reasonable (100-200 per machine).
  • Shared state → don’t mutate global lists from different tasks without locks. Use `asyncio.Queue` for producer-consumer patterns.
  • DNS resolution overhead → reuse `aiohttp.ClientSession` across the whole run. Don’t create a new session per URL.
  • Parsing speed → if you parse HTML with BeautifulSoup inside the async loop, you’ll block the event loop. Offload CPU-heavy parsing to a thread pool (`loop.run_in_executor`).
python
import concurrent.futures

def parse_html(html):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    return soup.title.string

# In your async function
loop = asyncio.get_event_loop()
title = await loop.run_in_executor(None, parse_html, html)

How This Relates to AI Agent Workflows

At ECOA AI, we use similar async patterns under the hood in our ACP orchestration platform. Every API call to an LLM, every webhook response—it’s all I/O-bound. The same `asyncio` patterns allow our multi-agent systems to handle hundreds of concurrent AI agent calls without breaking a sweat.

Our development team in Ho Chi Minh City actually built a large part of our production scraper infrastructure. They know this stuff cold.

Frequently Asked Questions

Can I use `requests` with `asyncio`?

No. The `requests` library is synchronous and blocks the event loop. You’d need a thread pool (e.g., `asyncio.to_thread`) which adds overhead. Stick with `aiohttp` or `httpx` (supports async).

How many concurrent requests should I use?

Start conservatively: 10 per host. If the server allows and you have proxies, you can push to 50-100. Monitor response times and error rates. A good rule: keep the total connection limit under 200 unless you’re on a very beefy machine.

What about parsing 1M+ response bodies efficiently?

Don’t load all HTML into memory at once. Use streaming: `await resp.content.read(1024)` or process chunks. For large-scale scraping, persist raw responses to disk or S3 first, then parse in batch. This decouples the I/O and CPU phases.

Do I need Redis or a queue for truly large scraping?

For jobs over 100k URLs, yes. Use `asyncio.Queue` to feed URLs to workers, and persist results incrementally. We’ve used Redis to store crawled state and resume after crashes. For the ultimate scale, tools like Scrapy (which has built-in async and distributed support) are worth considering.

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.