Enterprise Search 2.0 and the Journey to Relevancy

« NIE Newsletter

Enterprise Search 2.0 and the Journey to Relevancy

One of the most common subjects we are asked about by our clients and friends is enterprise search relevance. It matters to people because if users can't find the content they need, it impacts their productivity. A number of analysts have written about the return on investment of search, and invariably they quote huge numbers based on how much time knowledge workers lose every day because they can't find content.

How do you know if you have a relevancy problem? Very rarely do users ever tell you "The relevance is not very good"; what you need to listen for is vague dissatisfaction. "I can't find anything" – "It's not as good as Google" – "It's easier to poke around until I find it".

When a user tells you they are not happy with search but they cannot really give you specific reasons, chances are it's the relevance.

Everyone who manages corporate search wants the silver bullet, the secret answer. Here's the answer: there is no magic solution. Relevancy, Grasshopper, is a journey, not a destination. If you're all packed, let's get started.

Single shot

In the early years of enterprise search, back when it was called "text retrieval", the general wisdom was that you simply enter your question and the software would show you the answer. We call this the fallacy of single shot relevance, which we've written about before. The irony is that text retrieval tended to work pretty well, in part because we were just not searching much content; and the collections we did search tended to be pretty homogenous content.

But now enterprise content has grown even large than the entire internet was 10 years ago. It's not that uncommon to find millions of documents in a large corporate intranet, and we've even talked to folks who have well over 100 million documents they want to search. The chances that entering a single query will find the right content grow ever smaller - even if the user provides a long and complex query. Oh, and users rarely enter more than a few words at most.

Why do I need to learn our search engine when I was born with the ability to use Google? Google on the public internet is great. It delivers answers using their secret sauce, page rank. The thing is, just about any search you do on Google will find content from dozens, if not thousands of sites. If a few sites are missing, it's OK -- some other site has your answer for you. Unfortunately, your company probably has only one Investor Relations site, so your employee stock purchase plan better show up when requested.

We're not aware of any specific studies, but a number of people have told us that the public Google index works better for their web site than even the Google appliance they run in-house. It makes sense: the in-house system can't take advantage of all the external links that make your content popular on the web.

Tuning

Many search technologies include complex query languages that let you "tune" the search technology's generic relevancy algorithm. In the early days, Verity made quite a stir with its weighted ‘Topic trees’, and a number of other vendors followed suit. And even today, many companies spend months – even years – building the ultimate taxonomy to improve findability. The secret here is that very few of the taxonomy products ever tell you how to take advantage of all that effort in your search engine, so much of the work is in vain.

We've written before about search tuning and we certainly encourage you to make efforts to tune your search engine as well as possible by using all the controls and knobs your search technology may provide. Consider your document structure, vocabulary, and directory layout. Your objective is to generate a query term histogram with a nice "elbow" as illustrated in Figure 1.

An ideal query term histogram:

Figure 1

The elbow, the point at which the slope of the line turns, shows you where the "short head" ends and the "long tail" begins for your site. (Note: Depending on the structure of your site, you may need to do this kind of analysis for each section of your site – marketing, HR, research, etc., since the nature of the query terms may differ widely.

[Note: In the figure above, the vertical axis shows the number of hits generated by each search term in the defined time period. The horizontal axis represents the top queries, where each column represents a query term.]

The lesson here is that the overall query tuning algorithm you use will dictate the general shape of the histogram and improve results in general; but it alone won't be the magic bullet. Your journey has just begun.

The Top 20

Once you know the terms that make up the short head for your site (or sub-site), review the results for each of those terms. If the top two or three results for each query term are pretty good, create a new histogram of the remaining query terms. Find the elbow in the new histogram, and repeat the process. Continue doing so until you get into the "noise" category, that part of the long tail with no well-defined elbow.

Chances are good that the results will not be great for at least some of your top terms. When this happens, turn to "best bets" or recommended links to solve the problem – remember, the query tuning provides general relevance and shapes the histogram, but it won't solve all of your relevance problems. Work with your content owners to identify one or two results that provide pretty good answers and use your search engine tools to promote those documents.

Good news: this part of the process, this leg of the journey, is complete.

Search as a Journey

We've covered query tuning at high level, but to paraphrase Arlo Guthrie, that's not what we came here to tell you about. We want to talk about how humans look for things in the real world, and how you can use that understanding to lead your users to better results.

How do you find things on a public web search site like Google or Yahoo? Chances are you enter one or two terms, and look at the results. If you find the answer, great; but if you don't, I bet you don't go to the advanced search page – did you even know that there is an advanced search page on Google? Most likely you add or remove terms and try again. I bet this is how your corporate users surf the web as well. Why not make it easy for them to do the same thing in-house?

The current trend in advanced enterprise search is a technology called facets, dynamic navigators, or even parametric search. These search-driven hyperlinks engage the search user in a conversation by revealing some of the context that can help the user drill down and find the best document. Each time the user clicks another facet, the search technology generates a new set of faces – eliminating some, adding others. Once your user is thus engaged, he or she will probably continue and find some pretty relevant results.

Good news: to your back-end code, facets are the same as pull down options on your advanced search page. But facets guide your users to the answer without presenting a screen of an often overwhelming number of options.

A break in the journey

This is the point at which we will stop for now. You've learned that you need to user query tuning tools to generate a good query term histogram for your site, and that you then need to look at your search activity – what we call the Behavior based Taxonomy for your site – to provide best bets for your top queries. And you've seen that Enterprise Search 2.0 practices begin by engaging the search user in a conversation, a journey, where each step lets you deliver better results.

This process is successful because it provides you and your search technology with something the public search portals want badly. We'll continue our journey towards relevance next month by focusing on that something: context.