Enterprise Search Matrix
It's Not Always About Size
Chances are good that if you work for a company large enough to purchase and operate any of the leading enterprise search vendor software, you're familiar with the famous Gartner 'Magic Quadrant for Enterprise Search ' which search vendors and prospective customers consider to be an annual report card on search each year.
We find the Gartner report useful, and we would not claim to have anywhere near the insight into future growth that Gartner brings to the table. But we have implemented search using many of the Gartner quadrant vendors at Fortune 500 companies, and we have our own real-world opinion of the products in question. It's not that any of the largest vendor products are poorly designed; rather, we feel that customers should buy a solution that fits their current and anticipated needs, fits their budget, and fits the technical skill level of the people tasked with implementing and maintaining the solution.
To this end, we are introducing the 2005 New Idea Engineering Search Capability Matrix, which we don't claim is complete, but which, based on our experience, may help you select the right engine for the job - a process we discuss elsewhere in this month's newsletter..
The Search Capability Matrix
We present the matrix using two diagrams in a similar format. Both organize items into four columns. From left to right: low-end, mid-level, high-end and 3rd-party. “Higher-end” is a composite property that relates to properties like 'capabilities', 'complexity' and 'cost'.
The first chart shown in Figure 1 is a summary of core features that many enterprise search engine customers are concerned with. For brevity, we've consolidated some related feature sets, and omitted some of the less common ones.
As you move to the right, we're referring to higher capabilities, complexity and/or cost. The vertical position on the chart approximates the “commonality” of the feature, or where in the process customers tend to think about each item. The number of documents, for example, is a much more common concern than “federated” and “faceted” search capabilities, both of which we discuss at greater length below.
The second chart in the matrix, shown in Figure 2, lists our approximation of where each vendor lives, with relation to features described in the first chart. As you move to the right, vendors offer more and more features, but at a higher cost and often requiring more technical resources to implement and maintain. The top of the chart lists the 'big players' in the industry, followed by some of the newer candidates, and then by some of the more specialized offering.
Let's take a look at how to read the charts.
Vendor / Feature Matching
Large Number of Documents
In our columns, we've put the break about 100,000 docs. Below this number, almost any mid level vendor should be able to cope. But above 100,000 docs, there are some additional levels of separation, the realms of “really big” and “huge” numbers of documents. What constitutes a “really big” or “huge” database has changed over the past 10 years. 100,000 used to be “really big” and 1 million was “huge”. Now “really big” is over 100 million, and “huge” is over 1 billion. And yes, there are folks out there with that much data, and a few vendors willing to take it on.
Right now the billion+ candidate list is pretty short: FAST, Verity-K2 and Google. Unless your data is mostly simple web pages, PDF and simple database records, you can cross Google off, leaving K2 and FAST. Building and managing such a system is not going to a trivial task, and it's not going to be cheap. Also keep your eyes on IBM, this type of high-end high-visibility market is very lucrative. We also expect Microsoft and Oracle to make noises about handling these volumes of data.
In the million+ range, you have more choices. Rumor has it that Autonomy is comfortably in this space now, or at least would like to be. We also suspect Lucene can scale this high.
Less than a million docs adds in Verity Ultraseek and Endeca.
On the low end of the spectrum, if you're in charge of small to medium public site - say 100 to 1000 pages of content - you consider a hosted search engine. Many of these work with various document formats including HTML, PDF, and popular office formats, and a few can even handle simple login security. For example, our public web site uses FreeFind, which is excellent for small to medium public web sites; we actually use the paid version offering to eliminate advertisements.
Document Formats and Databases
Most mid to high end vendors handle most document formats and common databases. Best to double check, but you should be all set. If you have lots data in a content management system, especially if they are not one of the market leaders, its best to ask more detailed questions, this could be an issue.
Public versus Private or Secure Content
Really private content tends to be behind a firewall, and may also be password protected. In fact, a majority of our clients are deploying enterprise search engines on an Intranet, and thus the focus of this article and much of our other content and services. If you need precise document level security, you'll need to sit down with your vendor. And in particular, SSO (Single Sign On), which allows a user to logon once, and then access (and search for) data no matter where it's stored, is generally not a shrink wrapped feature with most vendors. Several vendors do have strategies to handle this, it's certainly “do-able”, but depending on your setup it won't be trivial.
Search Analytics and Content Promotion
Most vendors now offer rudimentary to intermediate search analytics. If you're using more than one search vendor, which most larger companies are, you'll want to look at a 3rd party solution. Of course our SearchTrack product offers cross vendor search analytics. In addition, the reports are integrated with our cross-vendor content promotion system; so when you discover an actual problem, you can take immediate action to fix it. Some traditional web traffic reporting tools can also produce search reports. You would need to configure it for your specific engine(s). WebTrends is well known, and NetGenesis has some built in search analytics.
Meta-Tags and Search Index Data Quality
“meta-tags” (or meta data) is a lose term that generally refers to additional fielded data associated with each document. The data may or may not be included inside the actual document; for HTML content, the fields may be defined in the header. Most vendors can easily extract and search meta data that is tagged in a standard way, such as in the “head” section of an HTML document. Engines may not include all meta fields by default, so some configuration changes might be needed. The higher-end vendors offer other tools for extracting additional fields from other sources. When it comes to questions like “did all my documents get indexed?”, “did all the documents get tagged correctly?”, or “how many unique values are there for this particular field?”, you've entered the realm of Search Engine Data Quality. We previously published a pretty long article talking about Data Quality, and it also has a more in-depth discussion of meta data / Meta Tags. (????? Link) Aside from “check your log files”, search engine vendors do not offer any such tools to assist you in this area. Applying data quality to search indices is a new industry, and as far as we know, our NIE Data Quality tool is the only such tool on the market.
Federated / Brokered Search
Generally, federated or brokered search refers to the ability to search more than one database or search engine at once, and combine the results. For example, with a federated search, users can search the company public web site and Google at the same time, and see the results combined - either grouped by which site produced a given set of results, or intermixed.
Companies use this capability to either offer results from many different places within or outside of the company, or to have multiple search engines running in parallel to allow better scalability. Some mid-level and most high-end vendors offer this capability, but mostly to service their own product lines. If you have engines from multiple vendors, you may need to look at 3rd party solutions.
Parametric / Faceted / Taxonomy Search
Faceted Search was the hot topic at the recent Enterprise Search summit in New York, and we provide some background on faceted search, parametric search, and taxonomies elsewhere in this newsletter.
Generally, most vendors provide the ability to use custom taxonomies with their search offering. With respect to the newer faceted or parametric systems, Endeca probably is probably the early thought leader in the space, with a strong installed base in retail applications and now moving into corporate sites. Verity K2 has supported parametric search for some time and FAST also supports similar capabilities. Several of these and other vendors give demos that can really knock your socks off, really some very impressive stuff. For large sets of data, we feel Verity and FAST are the best bets for now.
Our only caveat in this area is to remember your installation is not a demo, and you need to be sure the effort to implement the system is in the scope of your project. Ask your vendor lots of questions about the details of implementing all this gee-whiz technology.
Some questions to consider:
- Where does the meta data for the “facets” come from? How does it get into the search engine?
- How does the search engine know which fields to present as facets? How does it know which values to show?
- How does it handle non-finite data types such as floating point values or date-times ?
- How much manual configuration is necessary?
- Can values be “normalized” or “aliased” or otherwise “binned” ? How easy is this to do?
Quality of Search Results
This metric, while not a formal part of our matrix, is often a hot topic when we first meet with prospective customers. Sadly, the subject is quite broad and is not as easy as claiming some vendors have quality results while others do not.
Previous issues of our newsletter discuss result quality:
- Top 10 Tips for Better Search Results
- Intelligent Query Pre-Processing
- Poor Data Quality Gives Enterprise Search a Bad Rap - Part 1
- Poor Data Quality Gives Enterprise Search a Bad Rap - Part 2
Some generalizations we can give you include:
- Every vendor claims their search engine “relevancy” or “ranking” algorithm is great. Don't be swept away by specific brand names.
- Properly matched to the application, and properly tuned, most engines will deliver at least decent results.
- There are as many of theories on the subject as their are search experts - and that's quite a few. Define your requirements carefully if you are deeply concerned about this issue.
- Relevancy gets more important with larger document sets
- Ranking strategies that work on the public Internet will likely not work, or work quite differently, inside of an Intranet.
- Relevancy can be highly application specific; if search results quality is a critical business requirement, strongly consider a pilot project or on-site evaluation.
No matter which vendor you choose, use content promotion or 'best bets' to give the 'correct' answer for at least your most frequent 100 searches. Some vendors support this directly; for those that do not, use 3rd party tools like NIE’s SearchTrack.
There are plenty of other search vendors out there! Our columns have tended towards vendors we actually encounter in our practice, and vendors that our clients ask about. Nonetheless, there are some big names not yet represented in our matrix, including the following.
Microsoft is probably the biggest vendor not in our official 'columns'. They've had bad to mediocre search for so long that it's hard to take them seriously. However, they are persistent, and keep trying, and now offer search as part of their SharePoint portal. And of course they have MSN Search, Lookout Software, and their upcoming Longhorn technology. Oracle and IBM are also making gains in the marketplace; especially promising is IBM's OmniFind which works within the WebSphere portal.
Gartner 'Magic Quadrant for Enterprise Search ' is quite well known and respected in the industry, with a focus on 'vision' and 'execution' of product-lines as a whole, versus our focus on specific key features and deployment. Vendors you will see in their discussions which we have not addressed here include Open Text, Thunderstone, iSys, Hummingbird, ZyLAB, Intelliseek, Mercado Software, Entopia, Recommind, iPhrase, Convera, Kanisa, InQuira and EasyAsk.
With regard to hosted search, we see FreeFind, Google, Atomz and Mondosoft. We've made the choice for our site as we mentioned earlier, but all of these vendors can handle many public web sites. And of course, Google and Mondosoft also offer software that can be used within the intranet in the form of a hardware box (Google) or software (Mondosoft).
Third Party Tools
With regard to 3rd party vendors, those selling cross-vendor add-on tools targeted at search engines, we've not seen too many vendors yet. If you've read our newsletter in the past, you know that New Idea Engineering offers cross-vendor search analytic and promotion products. To be fair, some of the web log analysis vendors are starting to address search analytics as well.
Off the Scale
Every now and then we get questions about other vendors, either new players, very small companies, or offshore companies that enjoy market share abroad. Some of them make absolutely outrageous claims, although you never know - a few of those claims might actually turn out to be true. We find most enterprise customers looking for a more stable corporate track record when they select vendors. Nonetheless, some of these players can be very charismatic and convincing, so if they do manage to chat with one of your pointy-haired bosses, you may need to at least sit through their pitch. When in doubt, insist on local references, full evaluation tests, and get the help of an independent business consultant to help you with the selection.
A Note About Lucene
Lucene is an amazing piece of open source search technology created by search guru Doug Cutting. The main site is http://lucene.apache.org, but a search on your favorite engine will return many other links. There is also a related project, Nutch.
However, Lucene is a “toolkit”, not a finished shrink-wrap product. If you or your staff are not Java programmers, this is not the engine for you. Lucene is best suited as an embedded search engine, to be part of another product. It competes with “APIs like Verity's VDK, Ultraseek's XPA and Hummingbird's Fulcrum-based APIs. It is not a stand-alone enterprise search solution, at least not at this time. This is really only marginally an enterprise search solution, and only then for very specific applications.