Friday, July 30, 2010
Anatomy of a Search Engine

At the most basic level search engines share these three logical components:

  1. Spider and/or Indexer (AKA "data prep")
  2. A binary Fulltext Index (AKA "the index")
  3. The engine that runs the searches and gives back results (AKA "the engine")

Each one of these systems is dependent on the previous system in order to function.  A search engine can't run searches if there is no fulltext index.  And there won't be any fulltext index if the documents were never fetched and indexed.

Modern search engines have further subdivided the data prep, index and search functions into additional subsystems, in order to achieve better modularity and extreme scalability.

A fully exploded component view might look like:

Data Prep

Spider

Cross-Page Links Database

Document Cache

Fetch Web Pages

Extract Links to Other Pages

Scheduling Fetches and Refetching

Processing

Determine Mime Type

Filter Document

Parse Meta Data

Entity Extraction

Indexing

Determine Document Language

Separate into Paragraphs, Sentences and Words

Calculate Stemming, Thesaurus, etc.

Write to Fulltext Index

Fulltext Index

Word Inversion Index

Special Indexes (i.e. Soundex, Casedex, etc.)

Meta Data Index

Word Vector Data, N-Gram

User Ratings and Tags

Periodically Validate and Optimize Fulltext Indexes

Replicate Fulltext Indexes

Search Engine

Accept initial Query from the User

Preprocess Query (thesaurus, relevancy, recall, etc)

Distributed Query

Check Actual Fulltext Index

Merge Intermediate Query Results

Calculate Relevancy

Sorting and Grouping

Calculate and Render Navigators

Render Results to User

Gather User Feedback and Tags

 Even this outline is oversimplified for larger, more complex engines.

 
Diagram: Traditional Monolithic Search Architecture
What's NOT a Search Engine

Note that it is technically possible to search in just one step by scanning the source material line by line every time a search term is entered.  This is very slow and inefficient and we do not consider these systems to be true search engines.

Examples of these linear scan based "pseudo-search-engines" include:

  • The Unix find and grep utilities
  • The SQL "LIKE" operator
  • The "Search" menu option in applications like Microsoft Word

In addition to being very slow (relative to the fulltext index based designs), these simpler pseudo engines typically don't have advanced capabilities like stemming or thesaurus support.

Copyright 1996-2009 by New Idea Engineering, Inc.
Privacy Statement Terms Of Use