20+ Differences Between Internet vs. Enterprise Search - And Why You Should Care (Part 1)
By Mark Bennett, New Idea Engineering, Inc. - Volume 5 Number 2 - February/March 2008
Read Part 2 and Part 3 of this series.
Introduction
The perennial question of what separates Enterprise Search from the more familiar search engines that power the public Internet recently came up again. Dr. Search was planning to do a blog entry but the list mushroomed, and we now present the first in a three part series on the dozens of things that make Enterprise Search surprisingly difficult, and that sometimes flummox the engines that were created to power the public web.
As we hinted above, the public Internet was the inspiration and proving ground for a majority of the commercial and open source search engines out there. Solving that technical problem, indexing the Internet, has influenced both the architecture and implementation, as engineers have made hundreds of assumptions about data and usage patterns – assumptions that do not always apply behind the firewalls of corporations and agencies.
When vendors talk about their products, features and patents, they are usually talking about technology that was not specifically designed for the enterprise. This isn't just academic theory - as you'll see, these assumptions can actually break enterprise search, if not adjusted properly.
[back to top]
A Few Logistics
We've divided our list into "technical issues: user facing", "technical issues: back end data and indexing", and then "business and strategic" differences; we're doing the "easier" technical stuff in the first two parts, with the strategic and biz stuff as the finale. There's a bit of overlap, as some issues can be viewed from both a business and technical perspective, and data/indexing issues can affect what the user sees. Of course not every item applies to every project and vendor, "your mileage may vary". And heck, you may already know some of these, but we're trying to be quite comprehensive in scope, though perhaps a bit brief on some items. If anything catches your eye, that you'd like more details on, please drop us a note. And we've decided to let you do your own "numbering", this isn't late night TV after all.
[back to top]
Defining "Enterprise" for this article
To be clear, when we say "enterprise" search, we are referring to both the search engines that power private Intranets and Extranets, and to a lesser extent, the engines that companies have purchased to power their commerce and customer facing web sites. Broadly, "enterprise" search could be thought of as "all search engines EXCEPT the public Yahoo, Google and MSN", since you DO own and control the search engine that powers your public web site or online store. And again, your usage patterns and priorities are likely different from those of the Internet portals.
With all that said, let's get started!
[back to top] |
|
Part 1: Outline / Contents [Part 2] [Part 3] - Introduction
- High Level Internet / Intranet Mismatches
- The Enterprise is not just "a small Internet"
- Not just WPIWPO and "Search Dial Tone"
- Documents vs. Data
- Tech: User Experience: High Level Differences
- Single Shot Relevancy – Over-Reliance in Both Markets
- Enterprises Often Need Faceted Navigation
- Google Relevancy Trick "Broken" Behind the
- Web 2.0 Techniques Not DIRECTLY Applicable
- including Tagging, Blogs, Identities
- Different Specialized Search Clients
- Tech: Federated Enterprise Search
- Sidebar: Federated Search Details
- Flexible rules for combining results from all of the engines
- Maintaining Users Security Credentials
- Mapping User Security Credentials to other security
- Advanced Duplicate Detection and Removal
- Combining results list Navigators, such as Faceted Search links
- Handling other results list links such as "next
- Translating user searches into the different search syntaxes
- Extracting hits from HTML results, AKA "scraping"
- Tech: User Experience: Lower Level Differences
- Vocabulary and Thesaurus
- You Can Monitor "Performance"
- Search Syntax May Be Different
- Punctuation may be an important part of the actual Search
- Control over Duplicate Detection and Near Duplicates
- Different Security Requirements and Infrastructure
- Part 2: Technical Differences: Data and Spidering
- Part 3: Business and Strategic Differences
|
High Level Internet / Intranet Mismatches
These are some differences viewed from the broadest 10,000 foot level. We'll revisit some in more detail later.
[back to top]
The Enterprise is not just "a small Internet"
Imagine if you powered the Internet, and had a brand name that rivaled Coca-Cola. And then, imagine if you took all of that wonderful technological goodness with the wonderful brand name, and stuffed it into a brightly colored rack-mounted box. You would assume that, if you could handle the Internet, then of course you could handle a relatively puny private network - it just makes sense! You'd believe it, and so would your customers. To be fair, this was Google a few years back; their v5 appliance has clearly evolved beyond this simple model.
These seem like perfectly sane and compelling arguments, and this model has worked at some companies. If your Intranet has a few dozen (to a few thousand) company portals and departmental web sites, which mostly contain HTML and PDF documents, this would possibly work for you.
Or, suppose you had a portal that powered all of the Internet back in the 1990s. Slap that software on a CD, give it a nice Web based admin GUI, and ship it! This was actually the start of several well known search vendors. These products have also been iterated to add on enterprise functionality.
Ultraseek had been a great choice for more generic enterprise environments, and included some customization. Lately the Google Appliance is filling that segment, and can scale to reasonable sizes.
In contrast, some engines were not created for the Internet, but were always targeted at more specific business applications. As an example, DieselPoint was created to serve complex parts databases from its very beginning. It can also spider and search HTML and other document formats, but that was not its genesis.
[back to top]
Not just WPIWPO and "Search Dial Tone"
We define generic search engines as "Web Pages In – Web Pages Out" (WPIWPO). Basically, a spider crawls and indexes generic web content, and then the users run their searches from generic web browsers (such as Internet Explorer, Firefox, Safari, etc.). In the enterprise, however, content comes from many other sources, such as Content Management Systems, databases and archival storage appliances, etc. And users are not always running searches from a web browser – more on that later.
When you have all the pages indexed and basic search up and running, you have achieved "Search Dial-Tone". Nothing fancy, but basic search functionality is online.
[back to top]
Documents vs. Data
Modern search engines employ "fulltext" search, looking for specific search terms in relatively unstructured text. The unstructured nature of the data is the key; it was assumed that most content would be composed of paragraphs of text, with very little formal structure, verses the more traditional fields in a database, with their more rigid INT, DATE and CHAR datatypes. About the only assumption made about a document's "structure" was that it would probably have a Title of some sort. References to specific numbers, colors or geographic locations may be blindly treated the same as every other word in the document. (See also Entity Extraction.) Hopefully, if a term appears in the title, it will be given a bit more weight, but that's about it in the basic world of documents.
Fulltext searches are also much more free form. Think about how much easier it is to type a search into Google, verses creating an old-school SQL SELECT statement.
But Enterprises DO have data, lots of it, and it is often structured. And yes, they have millions of textual documents too. Enterprise search is often called upon to search across both textual documents and database records, and often as part the same user's search. Most modern engines support this type of hybrid search, and in many cases fields can be used to filter out extraneous matches and focus in on a particular set of results.
Some companies have content with both textual content and hundreds of "fields". For example, imagine a parts database for the airline industry. There would be descriptions of the parts, plus lots of meta data concerning part size, manufacturer, materials, applicable aircraft model, inventory levels, etc. Note that, in the document-centric world, fields are referred to as "meta data" or "attributes". A mechanic may want to search for "landing gear brackets" for the Boeing 747, made out of titanium, and that are less than 5 years old. This calls for a hybrid search, and possibly faceted navigation. (More on that later.)
[back to top]
Technical Differences between Internet and Enterprise Search
Here in Part 1, we'll focus on the "easier" technical differences. We've broken them into "User Experience" at the high and low level, and a section on data and spidering issues.
[back to top]
User Experience: High Level
As we've said, the way users interact with a corporate search engine may be quite different than the casual use of Yahoo and Google on the web.
[back to top]
Single Shot Relevancy – Over-Reliance in Both Markets
Long time readers will recall that we've warned before that an over-reliance on "relevance" can belie a flawed usage assumption, that ANY engine can find the "correct" answer for the typical 1 or 2 word search. On the Internet, when I type "bush", how could the system know with certainly whether I'm referring to President George Bush, junior or senior, or the Australian outback, or one of the many beers or musical groups with "bush" in their name (with or without that exact spelling), or perhaps a shrub to plant in the yard. In the enterprise, terms like "resource" or "schedule" are similarly ambiguous.
Ironically, this first issue is something public and private engines often have in common. Which leads us to…
[back to top]
Enterprises Often Need Faceted Navigation
Data in the enterprise often has more meta data, and can therefore allow users to drill down into search results; some vendors call this parametric search. We're talked about these terms in previous articles.
If the docs lack meta data, but are at least organized in some type of overall logical order, then a taxonomy might work.
If the content is completely unstructured, a sometimes-acceptable alternative is to use unsupervised clustering, but we usually view this as a workaround – if you have good meta data (or database fields), or at least some type of overall rule-based organization, faceted / parametric navigators or taxonomies will usually give more pleasing results.
[back to top]
Google Relevancy Trick "Broken" Behind the Firewall
This isn't to say that Google's code is buggy, not at all!
But Google's main improvement years back, on their public Internet portal, was to consider all the links to each web page in their ranking calculations – pages that were referred to by more sites were presumed to be more authoritative, and were therefore given a higher score.
This is sometimes called "organic linking", in that nobody controls the public Internet, and people frequently link from their web site to other sites based on their personal opinions. If you plotted all of these links on a graph, it might look a bit chaotic, like thousands of little roots or tentacles, but it really did give a decent approximation of human-determined "goodness" for each page. We're seeing a resurgence of this with blogs linking to other blogs.
But within corporations and agencies you don't have millions of users randomly creating links between pages, based solely on their personal preferences. Instead, the links between web pages inside a company are more orderly, and tend to approximate a logical org chart of sorts. It may seem more orderly, but it actually encodes less human-derived wisdom.
Thus, when the Google Appliance indexes data on a private network, it doesn't have the same advantage that it does on the public Internet, and therefore performs more on a par with other commercial search engines. It's not "bad", it's just not amazingly better than other guys.
Of course Google realizes this, and we suspect they have been trying to tweak their algorithms to compensate. And of course there are SOME links on company Intranets, and some employees do cross link pages. But in search engine industry insider speak, they are put back on the same "TF/IDF" playing field as everybody else. (See http://en.wikipedia.org/wiki/TFIDF)
[back to top]
Web 2.0 Techniques Not Directly Applicable
Social networking, "wisdom of the herd", blogs, the "user" as "man of the year". If you read any web or search related publications, you're very familiar with all these terms.
These trends have been used to enhance search on the World Wide Web, social networking sites, and leading ecommerce sites. Google's link ranking could even be considered an early form of this trend.
But these techniques have not worked as well on private networks. There are a bunch of technical reasons which we will detail in a future article. But here are a few examples:
[back to top]
Tagging
This is where users add descriptive keywords to documents, photos or video. The search engine looks at these tags for future searches.
Of course in the enterprise, where there are typically more documents than photos, and documents already h