MongoDB
Designing a Search Engine
Build a full-text search engine from scratch — covering inverted indexes, TF-IDF ranking, tokenization, and query processing.
S
srikanthtelkalapally888@gmail.com
Designing a Search Engine
A search engine indexes text documents and retrieves the most relevant results for a query.
Core Components
Documents → Indexer → Inverted Index
↓
Query → Query Processor → Ranker → Results
Inverted Index
Maps each word to the list of documents containing it.
"apple" → [doc1, doc3, doc7]
"mango" → [doc2, doc3, doc9]
"fruit" → [doc1, doc2, doc3]
Tokenization Pipeline
Raw text: "Running Quickly through the Forest"
↓ Lowercase
"running quickly through the forest"
↓ Remove stop words (the, through)
"running quickly forest"
↓ Stemming
"run quick forest"
↓ Index tokens
TF-IDF Ranking
TF = term frequency in doc / total words in doc
IDF = log(total docs / docs containing term)
TF-IDF = TF × IDF
Words rare across corpus but frequent in document score highest.
BM25 (Modern Standard)
Improved TF-IDF with document length normalization — used by Elasticsearch and Lucene by default.
Query Processing
Query: "apple fruit"
↓
Lookup: apple → [doc1, doc3, doc7]
fruit → [doc1, doc2, doc3]
↓ Intersection (AND) or Union (OR)
Rank by BM25 score → [doc3, doc1, ...]
Query Types
Boolean: "apple AND fruit NOT vegetable"
Phrase: ""fresh apple juice""
Fuzzy: "appl~" (edit distance 1)
Wildcard: "app*"
Distributed Search
Documents → sharded across nodes
Query → broadcast to all shards
Each shard returns top-K
Coordinator merges and re-ranks
→ Global top-K returned
Conclusion
Inverted index + BM25 ranking is the foundation of modern search. Elasticsearch scales this to billions of documents across distributed shards.