MongoDB
Designing a Web Crawler
Architect a distributed web crawler like Googlebot — covering URL frontier, politeness, content deduplication, and storage at petabyte scale.
S
srikanthtelkalapally888@gmail.com
Designing a Web Crawler
A web crawler systematically browses the internet to index content for search engines.
Requirements
- Crawl 1 billion pages in 30 days
- Handle duplicates
- Respect robots.txt
- Store raw content + metadata
Scale Estimates
1B pages / 30 days = ~400 pages/second
Avg page size: 100KB
Total storage: 100TB
Architecture
URL Frontier (Queue)
↓
Fetcher Workers
↓
Content Parser
↙ ↘
Content Store URL Extractor
(S3) ↓
Dedup Filter
↓
URL Frontier (new URLs)
URL Frontier
A priority queue of URLs to crawl.
- Priority: Based on PageRank, freshness, domain importance
- Back queues: Per-domain queues for politeness (crawl delay)
Politeness
Respect robots.txt:
User-agent: *
Disallow: /admin
Crawl-delay: 1
One request per domain per second minimum.
Deduplication
URL Dedup
Bloom filter to check if URL already crawled.
hash(url) → Bloom filter lookup
Content Dedup
SimHash of page content — detect near-duplicate pages.
DNS Cache
DNS resolution is slow. Cache DNS results per domain:
domain → IP (TTL: 1 hour)
Storage
- Raw HTML: S3
- Metadata: HBase (URL, crawl_time, content_hash)
- URL Frontier: Apache Kafka + Redis
Conclusion
Crawling at scale requires URL frontier management, politeness controls, content deduplication, and distributed fetcher workers.