Search 2.0: Moving Beyond "Single Shot" Relevancy

« NIE Newsletter

Search 2.0 in the Enterprise: Moving Beyond "Single Shot" Relevancy

Exponential Growth Swamps Results Lists

We've all heard the hype about the exponential growth of the Internet. But there are a few related developments, not so widely covered, that Enterprise Search people need to keep in mind:

Point 1: As the overall number of documents grows, so does the average number of documents returned by each search.

Think about it, if a broad search matches 10% of the documents, then it would match 100 documents out of 1,000 total. If the same search is run later when there are now 2,000 total documents, it would return on the order of 200 documents in the results. This may not be the case for every search; the overall makeup of the documents may change over time, or specific terms may come and go, but on average results list sizes will tent to grow exponentially, generally mirroring overall data growth.

As content grows exponentially over time, so do Search Results

Point 2: Enterprise data is also growing exponentially.

Of course the growth curve may not be as steep, and will vary from company to company, or agency to agency, but exponential growth at any reasonable rate adds up surprisingly fast.

You'll recall in the 1990's when the public Internet crossed certain boundaries; the headlines would report when the net had surpassed 10 million pages, 100 million pages, 1 Billion pages, etc. We are now at the point where many private networks are now as large as the entire public Internet was in the late 1990s; yes, there are now private datasets that have crossed, or will soon cross, the 100 million and 1 Billion document marks. This means that the Internet search engine problems of the 1990s are now hitting corporations and other institutions; this is a "big" problem.

The problem with "Single Shot Relevancy": The Problem even Google Can't Fix

"Single Shot Relevancy" is the idea that a human types in a search, and the search engine returns the "best" most relevant documents on the first page of results. One query goes in, one amazing results list comes back. Early vendors' relevancy was based solely on the content of the documents themselves, later vendors added external such as Google's link ranking.

When vendors talk about relevancy, comparing themselves to each other, this is usually what they're referring to. But because of exponential data growth, at some point they will all fail.

It's not that there aren't any good relevancy algorithms around, there certainly are. And you can even employ "query cooking" to improve them further.

But imagine 2 search engines, I'll call them "Average Engine" and "Good Engine". Give each engine 1,000 documents to index, then run a reasonable search. Maybe the search matches 5% of the content, returning 50 documents; both engines are probably going to manage to put a few decent documents on the first page. But now imagine increasing the docset from 1,000 to 10,000 – this would raise the result set for the previous search to 500 documents – and at that point the "Average Engine" is probably going to mess up the first page of results – it has a good chance of not putting reasonable documents on the first page. Meanwhile, the "Good Engine", though better ranking algorithms, manages to still get decent results on the first page.

But let's keep going with this. Now think of a similar dataset that is 2 orders of magnitude bigger, 100 times larger, and now we're indexing 1 million documents. The results set has mushroomed to 50,000 hits; at this point "Good Engine" is going to have some problems populating that first page with relevant results – or at least what the user would think of as relevant. OK, so now for this 1 million document dataset we consider a couple new vendors, "Cool Engine" and "Great Engine". Let's assume that both these engines do a better job than "Good Engine", and even with 50,000 hits, they can both usually put right correct documents at the top.

And one more time, this is important, now consider another dataset, 100 times larger again; that puts the raw data at 100 million documents, and a results list on the order of 5 million documents. At this lofty number, even "Great Engine" and "Cool Engine" are going to be straining; even slight problems with ranking will be greatly amplified.

Even if one of the vendors could miraculously improve their engine's relevancy, those gains would likely be wiped out by an even larger result set.

Classic Single Shot Relevancy is doomed when dealing with giant datasets.

Don't Be Confused: We're not talking about simple "performance"

A note to the reader: There are two paths this type of discussion usually follows, two distractions, neither of which is pertinent to this discussion of "search v2".

Distraction 1: "OK, so you're talking about search engine performance; how long it takes to index and search documents?"

No. While that is certainly an important factor, the big players in the industry can scale up, using multiple servers, to handle vast quantities of data. If you got 100 million documents and a big enough budget, you will be able to get them indexed and searchable.

Distraction 2: "Ah, so you're talking about Relevancy – getting the right documents to show up at the top of the results list?"

No, not exactly. Better relevancy, or "document ranking algorithms" will certainly buy you some time; but each time the number of matching documents increases, the effectiveness of even the best algorithms will eventually fail. This type of relevancy, which we call "Single Shot Relevancy", will eventually fail when the dataset gets large enough – it's only a matter of time.

The mistaken "HAL9000" Assumptions about Users and Questions

Before we get into "solutions" and "search v2", we need to examine the base assumption of the old Relevancy game.

The assumption that even a human could identify the most relevant document out of 5 million hits, given enough time, is suspect, let alone asking a machine to do it.

Many early computer scientists were inspired by computer "HAL9000" in the science fiction epic "2001: A Space Odyssey" from the late 1960s. Obviously our software isn't there yet; and even "HAL" might struggle a bit with that task.

But some other assumptions that HAL inspired are questionable. In the movie, the human operators asked well thought out questions, and those questions do have "correct" answers. All of you have seen the articles citing the average query as being only 1 to 2 words in length; humans don't usually ask well thought out and complete questions. And even the correctness of answers is suspect; imagine the debate that would ensue grading the results of the lengthier query "effectiveness of tax cuts to stimulate the economy". Regardless of your personal stance on this issue, you can imagine that people would vary widely over whether the answers from the RNC or MoveOn.org are more relevant.

The "Patches" to Postpone the Inevitable

As mentioned earlier, search engines did start including external data into their relevancy engine. Google used "link ranking", taking into account the number of other web sites that linked to a particular page – the assumption being that more links indicated a more authoritative and therefore more relevant page.

Other engines started taking click-through rates into account. If everybody who enters a particular search clicks on document # 3, then presumably that document is more important, and should therefore be bumped up to the # 1 slot.

Still other engines worked with "context", trying to look at your previous surfing and search activity, or job function within a corporation.

We laud all these efforts; they have certainly got us further along. But they have not been as effective inside private networks. Link ranking, Google's "secret sauce", doesn't work as well inside corporations, because page links are generally based on org charts or subject taxonomies, vs. user by user. And the new big fad, paying for placement, isn't really applicable.

But fundamentally, this is all still "Single Shot" logic.

In recent years, even Google's public portal struggles to give decent results for many searches. If what you're looking for doesn't line up with what most folks are interested in, or you can't get your search terms just right, Google can return page after page of garbage. Exponential growth has overtaken even mighty Google's ranking.

The Internet Search Meltdown of the Late 1990s is Now Visiting Enterprise Search

Over the past 3 years we're seeing client after client complain that their enterprise search relevance is unacceptable. The most common quotes they relay from their users are that "we can't find anything" and "why can't out search be more like Google?", or simply "our search sucks".

When you run the numbers, many private networks are now as large as the public Internet was in the 1990s; it was inevitable that the public search crisis of the late 90s would hit the enterprise when you look at the size of many private data sets.

And what technology finally rescued the public Internet? Many say Google did. They used the links between web sites to judge which sites were "better"; in effect they overlaid external data to improve results, an early type of social network voting.

"Search v2" = "Search v1.5"

To our minds, "v2" sounds revolutionary, whereas much of what we're seeing has been tried before, sometimes with limited success. That's OK, software often goes through several cycles before it's completely usable, but if some of this looks familiar, you're not imagining things.

But ironically, even the Enterprise version of Google, their "Google Box", can't use this trick to rescue Enterprise search. Without that extra data, they perform similar to other engines. Google's brand name is certainly levered in the enterprise space, but their relevancy mechanism isn't.

So the search meltdown of 1990s is coming ashore in the enterprise, and the Google link-rank lifeboat is nowhere in site. It's OK! Even patches to single-shot relevancy were only temporary; it's time to rethink the entire search process.

Search v2: Rethinking the Results List - The Power of Well-Implemented Drill Down Search

Since it's unrealistic for search engines to guarantee the "best" answer at the top of every results list, the best they can do is empower the human user to see what information is available and give them convenient power tools to navigate through the clutter.

Author John Battelle said it best in his book The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture when he compared the Google results list to the infamous DIR command from the DOS operating system. Search results are evolving from simple lists of documents to interactive tools that show the type of data that is available, and allow the user to efficiently navigate to their answer. In some cases the results list may even answer the question directly.

Types of Results List Navigation

Here is a quick tour of the types of "Search v2" features enterprise search vendors are starting to offer.

Disclaimers: In the examples that follow, we use the sitting US president and recent events as an example, and show data that you might get back on a public search. Please don't take our examples to have any political meaning – political figures and world events are standard examples when discussing search engines. Your users would be searching for data specific to your company, so try to think of financial or technical terms that might be more applicable. Also keep in mind that all of these examples are mockups; actual technology isn't quite up to performing some of these miraculous tasks yet. And finally, it's unlikely that any single site would implement all of these – it would be too cluttered.

Textual Drill-Down Navigators:

These are examples of the most common types of iterative search results elements offered today. When implemented accurately, they can be quite helpful; in the past some bad vendor implementations have frustrated users. Good quality drill downs often require at least some human supervision (though a majority of the actual work can be automated).

[using textual hyperlinks to navigate results]
Shown here:

Entity Extraction based on proper name (upper left)
Clicking on any of these terms will drill down into articles that include that person's name, along with the query term "bush".
Entity Extraction based on geographic location (center left)
Performs similar to the People navigator.
Noun Phrase entity extraction (lower left)
Noun phrases are becoming very popular in the industry. They can use language rules or dictionaries to find noun phrases. Noun phrases are typically defined as a noun with all its surrounding qualifiers, or a sequence of words previously defined to be a single subject of interest. Vendors have seen stuck on one and two word phrases in the past, though lately some vendors are experimenting with longer phrases.
Directed Results (upper right)
This feature can also be called "Best Bets" or "Webmaster Suggests", depending on the vendor. In this case a specific suggestion for a likely web page is being offered; these suggestions are likely to have been created or approved by a human editor. We believe this is one of the easier "bang for the buck" ways to improve enterprise search; efforts can be spent on the top 100 or top 1,000 searches and related areas. We have also referred to this type of activity as Behavior Based Taxonomies.
Spelling Suggestions (center right)
This mechanism can also be used to suggest commonly related searches, such as accessories for a product, or related instructional links. Implementations vary from completely automated to completely manual.
Subject Oriented Taxonomy (lower right)
This may look the most familiar and remind you a bit of Yahoo. Clicking on any of these links would have re-issued the user's search to content in that specific area. If you think about it, "b u s h" could relate to politics, gardening, beer, or other diverse subject areas. The biggest challenge with this type of system is getting your content into a well organized taxonomy. Beware of vendors who promise complete automation! At a minimum, you should be able to edit or override the system's groupings and rename the node labels.

Non-Textual Suggestions:

In the first example, 3 links are offered to multimedia items related to "b u s h". The first is a link to a video of a President Bush State of the Union address, the second links music by the pop artist Kate Bush, and the third links to a gardening show on planning a bush in a back yard. The second graphic presents polling data (fictitious). In an enterprise, a graph might show sales or customer satisfaction related to a particular query term.

[returning pictures and video]

Fact Aggregation:

In what may be very close on the horizon, these examples show the system attempting to bring together snippets of data related to the query. Though this may sound like it borders on AI (Artificial Intelligence), vendors are taking a more mundane approach. One step beyond entity and noun phrase extraction is simple fact extraction. If enough articles mention a fact, and some reasonable percentage can be normalized down to a common form and tabulated, the system builds evidence to support a the fact. In this case, we're not asking the system to understand every sentence in every document. Instead, the system is simply breaking down all documents into smaller units, normalizing them structurally, and then tabulating statistics. In the first example, such a system would have already broken down George H. W. Bush, George W. Bush and United states by dictionary matching. "41^st president" and "43^rd president" would be recognized as noun phrases. Landscaping revenue would also be identified as a noun phrase. It's not too much of a leap to have pieced together the sentence structure necessary to assemble the facts. Keep in mind the system may have seen thousands of documents with almost identical information; some of the documents might have even used this exact wording and then other variants were used as statistically supporting evidence. In the second example, dates are a well understood entity that many systems can now extract, even when they appear in the text of a document. Combing that with the fact process previously mentioned, a timeline might be culled after tabulating thousands of documents. In some cases, the results list will actually contain the answer the user is looking for. If the "bush" query had been to find out quick presidential facts, a user might not have even needed to click any further; in these cases the results list IS the answer. While this is nice when it happens, users should not expect this on a regular basis or for more complicated searches.

[aggretaging facts into the main results list]

This is different than the "HAL9000" model. No specific question was asked, the query was just "bush". And the system didn't need to understand these facts from only 1 or 2 documents. Further, such "factoids" could that are auto-generated for popular searches could be reviewed by human editors before being published.

An approach that makes these tricks possible, while not relying on fancy AI, is that vendors are approaching simple sentence parsing from two angles. On the one hand, instead of single words, they are searching for multi word phrases using various techniques; these multi word entities can be used as the seeds in straightforward statistical analysis. At the same time, they are breaking down documents into smaller chunks, analyzing statistical correlations at the paragraph and sentence level. None of these simple methods are attempting to identify broader concepts; for example, when these systems encounter the pronoun "he", they are generally not backtracking to previous sentences trying to figure out which specific person this "he" is referring back to; a sentence with "he" or "she" as a subject will probably not contribute much to any simple statistic tabulation of facts.

Whether this round of implementations will pull this off any better than previous attempts is for the market to decide. We suspect in some cases some of these fact extraction techniques will yield satisfactory results.

Visualization and Sentiment Extraction:

By far the "sexiest" demos that vendors give use automatically generated graphics to give a high level view of perhaps thousands of matching documents. These techniques gather data in similar ways to the previous items we described, namely applying statistical methods to extracted words, phrases and entities. The first mockup, the "floating cloud of interconnected nodes" has been show by many vendors for over 10 years in different forms. Are they useful? Do they work at all? And once the newness wares off, would users ever get in the habit of navigating results in such a novel way? We just don't have the answer. We've seen plenty of implementations that were not productive in practice, so we are a bit skeptical. But lots of programmers have been spending lots of time refining these methods, and we're at a point where potential customers could potentially evaluate different implementations side by side; if real competition heats up in the enterprise space it could spur lots of improvements. Just as an example, letting editors conveniently adjust these meshes, interactively in real time from the native interface, it would probably make a big difference; improving UIs, while not trivial, doesn't require any AI leaps.

[more radical approaches to gauging overall search results]

We've seen some even more impressive demos of graphics based visualization of results, but due to NDA's were can't discuss the here. But if you are interested, talk to your vendors and ask for a demo; there are some pretty interesting techniques being packaged for the enterprise.

Other Ideas Bubbling Up (not illustrated)

There are quite a few less sexy, but potentially still quite useful ways of improving search results.

Leveraging "Context":

If a system knows something about you, your job function, searches you've done before, or even the first link in a results list that you clicked on, then perhaps a search can be rapidly tuned. Vendors are talking a lot about this.

Social Networking / "Voting":

Identifying similar users, or using the activity of other users to gauge relevancy, is another general technique that is being re-implemented in many different ways.

Some of these methods are similar to Google's link ranking; some vendors are counting 'click through' as a gauge of relevancy; this may work better inside the enterprise where hopefully titles are not at least intentionally misleading. Other systems feel more like Amazon and Netflix's suggestion engines.

The User's "Frame of Mind":

Previously we talked about sentiment extraction, trying to gauge the opinions of content authors. Yahoo has tried something different, allowing folks to push a slider between "Research" and "Purchase", re-weighting results towards pure authoritative information, or towards vendors offering specific products. We laud this experiment. In an enterprise we could imagine sliders that move between "marketing" and "technical". Searches for product names often bring back a mix of press releases and troubleshooting / FAQ articles; this type of slider would help select one type of data vs. the other. Some vendors would even say that, by moving the slider, the user has provided additional "context".

Multiple "Databases":

There is a trend to run searches against multiple data sources and combine the results. In enterprise software this is often called "federated search". While, by itself, it only adds to exponential growth of results lists, when it is combined judiciously with other data and presented in a logical or non-intrusive way, it can provide additional facts. If you think about it, even reports and graphs can be thought of as a "search".

Since each vendor focuses on their own technology, this trend is not being driven by them. Instead, smart users, programmers and IT folks are driving this. As an example, for every search issued to the system, it quickly checks to see if it looks like an employee name. In 95% of the cases it doesn't, and nothing extra is added to the results list. But in a few cases, if a high confidence of an employee match is seen, the system will give a one line suggestion like "Satish Jones, QA/Mountain View, x1028, email: statishs" Or if a query like "sales figures for last quarter" is seen, a callout to a graphing package can be included in the results. These are guesses: the user might have actually intended a full text searching looking for a document, but there's a chance that this was the answer they were looking for.

Overuse of these features, or really "fuzzy" matching which turn out to often be incorrect, gave these types of add-ins a bad reputation. This time around, discretion may make a better impression.

The Biggest Pushback

The biggest pushback we've had from clients on interactive search is that their users simply won't use it. This can lead to a long discussion about usage rates and measurement techniques, etc. Some customers have even presented reports showing low click through rates for initial drill down links that were offered.

In the end we believe that WELL IMPLEMENTED drill down search will be used, especially by knowledge workers whose livelihood depends on finding answers.

Some factors that affect usage rates of advanced interactive search:

Providing quality suggestions
Though this seems obvious, we suspect some bad implementations have soured users.
A corollary to item 1: "when in doubt, leave it out"
If the system doesn't have high confidence on statistically generated suggestions, then consider not displaying them for that search. Bad suggestions are often WORSE than no suggestions.
Less is sometimes more when it comes to suggestions.
Extending on item 2, many systems can generate dozens of suggestions, but you should consider only presenting a handful of them at a time. Adding a small "more…" link will give users the option of getting more help.
Avoid overall clutter
Don't put too many different types of navigators on the results list at once.
Placement matters
It's generally accepted that links along the top of results will get more attention than suggestions on the left or right, though items at the top should be kept very compact.
Revisit your measurement of new-feature adoption rates
When trying to measure the use of these new features simple click through percentages may not tell the whole story. Please contact us if you would like to discuss this further.

But by far our biggest argument in favor of interactive search is that, yes, even Google does it. When you type in what looks like a zip code or address, or a well known personality, Google suggests various links and images at the top of their results. And their "Did you mean" spell checker is widely known. Google has deployed these extra features in a conservative stylish way, and suggests things only when it has a fairly high confidence. We suspect Google would not continue to offer these items if nobody used them.

In Summary

Exponential growth will eventually break even the best search engine's relevancy. Private data sets are now as large as the public Internet was when single shot relevancy failed there, and Google's workaround of link ranking won't work in the enterprise.

Instead of relying on "Single Shot Relevancy", start planning for a well implemented interactive search process.

Be a bit open minded to new ideas, there are lots of new features headed your way. But at the same time, don't be afraid to ask for an on-site proof of concept or pilot project using your own data.

To improve the efficiency of expensive knowledge workers any further, you'll probably need to look past the old-school "search dial-tone" functionality. It's a buyers' market for search right now and most vendors have new offerings - check them out!