Anatomy of a Search Engine

Top

Search this site:

Enterprise Search Blog

At the most basic level, search engines share these three logical components:

Spider and/or Indexer (AKA "data prep")
A binary Fulltext Index (AKA "the index")
The engine that runs the searches and gives back results (AKA "the engine")

Each one of these systems is dependent on the previous system in order to function. A search engine can't run searches if there is no fulltext index. And there won't be any fulltext index if the documents were never fetched and indexed.

Modern search engines have further subdivided the data prep, index and search functions into additional subsystems, in order to achieve better modularity and extreme scalability.

A fully exploded component view might look like:

Data Prep

Spider

Cross Page Links Database
Document Cache
Fetch Web Pages
Extract Links to Other Pages
Scheduling Fetches and Refetching

Processing

Determine Mime Type
Filter Document
Parse Meta Data
Entity Extraction

Indexing

Determine Document Language
Separate into Paragraphs, Sentences and Words
Calculate Stemming, Thesaurus, etc.
Write to Fulltext Index

Fulltext Index

Word Inversion Index
Special Indexes (i.e. Soundex, Casedex, etc.)
Meta Data Index
Word Vector Data, N-Gram
User Ratings and Tags
Periodically Validate and Optimize Fulltext Indexes
Replicate Fulltext Indexes

Search Engine

Accept initial Query from the User
Preprocess Query (thesaurus, relevancy, recall, etc.)
Distributed Query
Check Actual Fulltext Index
Merge Intermediate Query Results
Calculate Relevancy
Sorting and Grouping
Calculate and Render Navigators
Render Results to User
Gather User Feedback and Tags

Even this outline is oversimplified for larger, more complex engines.

Traditional Monolithic Search

What's NOT a Search Engine

Note that it is technically possible to search in just one step by scanning the source material line by line every time a search term is entered. This is very slow and inefficient and we do not consider these systems to be true search engines.

Examples of these linear scan based "pseudo-search-engines" include:

The Unix find and grep utilities
The SQL "LIKE" operator
The "Search" menu option in applications like Microsoft Word

In addition to being very slow (relative to the fulltext index based designs), these simpler pseudo engines typically don't have advanced capabilities like stemming or thesaurus support.