At the most basic level search engines share these three logical components:
- Spider and/or Indexer (AKA "data prep")
- A binary Fulltext Index (AKA "the index")
- The engine that runs the searches and gives back results (AKA "the engine")
Each one of these systems is dependent on the previous system in order to function. A search engine can't run searches if there is no fulltext index. And there won't be any fulltext index if the documents were never fetched and indexed.
Modern search engines have further subdivided the data prep, index and search functions into additional subsystems, in order to achieve better modularity and extreme scalability.
A fully exploded component view might look like:
Data Prep
Spider
Cross-Page Links Database
Document Cache
Fetch Web Pages
Extract Links to Other Pages
Scheduling Fetches and Refetching
Processing
Determine Mime Type
Filter Document
Parse Meta Data
Entity Extraction
Indexing
Determine Document Language
Separate into Paragraphs, Sentences and Words
Calculate Stemming, Thesaurus, etc.
Write to Fulltext Index
Fulltext Index
Word Inversion Index
Special Indexes (i.e. Soundex, Casedex, etc.)
Meta Data Index
Word Vector Data, N-Gram
User Ratings and Tags
Periodically Validate and Optimize Fulltext Indexes
Replicate Fulltext Indexes
Search Engine
Accept initial Query from the User
Preprocess Query (thesaurus, relevancy, recall, etc)
Distributed Query
Check Actual Fulltext Index
Merge Intermediate Query Results
Calculate Relevancy
Sorting and Grouping
Calculate and Render Navigators
Render Results to User
Gather User Feedback and Tags
Even this outline is oversimplified for larger, more complex engines.