MongoDB

Designing a Search Engine

Build a full-text search engine from scratch — covering inverted indexes, TF-IDF ranking, tokenization, and query processing.

S

srikanthtelkalapally888@gmail.com

Designing a Search Engine

A search engine indexes text documents and retrieves the most relevant results for a query.

Core Components

Documents → Indexer → Inverted Index
                           ↓
Query → Query Processor → Ranker → Results

Inverted Index

Maps each word to the list of documents containing it.

"apple" → [doc1, doc3, doc7]
"mango" → [doc2, doc3, doc9]
"fruit" → [doc1, doc2, doc3]

Tokenization Pipeline

Raw text: "Running Quickly through the Forest"
    ↓ Lowercase
"running quickly through the forest"
    ↓ Remove stop words (the, through)
"running quickly forest"
    ↓ Stemming
"run quick forest"
    ↓ Index tokens

TF-IDF Ranking

TF  = term frequency in doc / total words in doc
IDF = log(total docs / docs containing term)
TF-IDF = TF × IDF

Words rare across corpus but frequent in document score highest.

BM25 (Modern Standard)

Improved TF-IDF with document length normalization — used by Elasticsearch and Lucene by default.

Query Processing

Query: "apple fruit"
    ↓
Lookup: apple → [doc1, doc3, doc7]
        fruit → [doc1, doc2, doc3]
    ↓ Intersection (AND) or Union (OR)
Rank by BM25 score → [doc3, doc1, ...]

Query Types

Boolean: "apple AND fruit NOT vegetable"
Phrase:  ""fresh apple juice""
Fuzzy:   "appl~" (edit distance 1)
Wildcard: "app*"

Distributed Search

Documents → sharded across nodes
Query → broadcast to all shards
Each shard returns top-K
Coordinator merges and re-ranks
→ Global top-K returned

Conclusion

Inverted index + BM25 ranking is the foundation of modern search. Elasticsearch scales this to billions of documents across distributed shards.

Share this article