MongoDB

Designing a Web Crawler

Architect a distributed web crawler like Googlebot — covering URL frontier, politeness, content deduplication, and storage at petabyte scale.

S

srikanthtelkalapally888@gmail.com

Designing a Web Crawler

A web crawler systematically browses the internet to index content for search engines.

Requirements

  • Crawl 1 billion pages in 30 days
  • Handle duplicates
  • Respect robots.txt
  • Store raw content + metadata

Scale Estimates

1B pages / 30 days = ~400 pages/second
Avg page size: 100KB
Total storage: 100TB

Architecture

URL Frontier (Queue)
    ↓
Fetcher Workers
    ↓
Content Parser
    ↙         ↘
Content Store   URL Extractor
(S3)               ↓
              Dedup Filter
                   ↓
           URL Frontier (new URLs)

URL Frontier

A priority queue of URLs to crawl.

  • Priority: Based on PageRank, freshness, domain importance
  • Back queues: Per-domain queues for politeness (crawl delay)

Politeness

Respect robots.txt:

User-agent: *
Disallow: /admin
Crawl-delay: 1

One request per domain per second minimum.

Deduplication

URL Dedup

Bloom filter to check if URL already crawled.

hash(url) → Bloom filter lookup

Content Dedup

SimHash of page content — detect near-duplicate pages.

DNS Cache

DNS resolution is slow. Cache DNS results per domain:

domain → IP (TTL: 1 hour)

Storage

  • Raw HTML: S3
  • Metadata: HBase (URL, crawl_time, content_hash)
  • URL Frontier: Apache Kafka + Redis

Conclusion

Crawling at scale requires URL frontier management, politeness controls, content deduplication, and distributed fetcher workers.

Share this article