Poor Data Quality in Enterprise Search - Part 1

« NIE Newsletter

Poor Data Quality Gives Enterprise Search a Bad Rap - Part 1

Overview

Users continue to be frustrated with the quality of search results. Many of these problems can be traced back, at least in part, to poor data quality in the search indices. This means companies do not achieve the full return from their investment in enterprise grade search software.

But there is hope! There are metrics that you can use to detect many of these problems and to measure your progress. You can automate many of these tests, some fully and some partially. After all, buying an entirely new search engine may produce better results, but switching vendors can be painful and expensive; and if the data is at fault, you may find your new engine returns poor results as well.

There are two general aspects of Data Quality, as it applies to search engines:

Index Creation Data Quality Issues

Coverage: Is all of your data being indexed by your search engine?
Processes, logs, errors, etc.: Are you doing basic monitoring of the indexing process?
Document Meta Data: Are all the important fields and attributes of your documents being accurately captured?

Search-time Quality Issues

Relevancy: Are the results that are being returned relevant?
Vocabulary: Are you experiencing a vocabulary disconnect with your users?
Performance: Is your search engine up all the time and providing results to users in a reasonable amount of time?

Index Creation

Vocabulary: a "document" is a unit of data indexed and searched by your search engine; typically each document is equivalent to a web page on your site, or perhaps a Microsoft Word or Adobe PDF file, or a record in a database.

Like other information systems, search engines can suffer but "Garbage in, garbage out". All serious search engines create highly optimized indices of your documents' contents on disk. These indices tabulate and record all the words and meta data items for all of your documents.

At search time, these compact indices are consulting (versus re-scanning all of your actual documents). If the original content is needed, for example to view a specific document in the results list, the original document is retrieved at that time.

Basic Site Indexing

Vocabulary: a "spider" is a special type of document indexer that follows links on a web site, to eventually index the entire web site. It goes from web page to web page, via the HTML hyperlinks, until the entire site has been indexed.

Is Your Site Being Spidered?

The most basic question to ask is whether your entire site being spidered and indexed? How many documents does your search engine report it has found? You should do a spot check and compare this count to other sources.

Public Sites: Using Google as a spot check

Getting a ballpark estimate from another source for a public web site is relatively easy. The quickest check is just use Google. Issue a search on Google that has a very common word on your site (such as part of your company name), and then add the site:yoursite.com filter to the end of your search. Don't use a common words like "the" or "and" as the search word, since Google ignores them. Choose common word, yet one which is unique to your site: you might even use a word used consistently in your navigation bar.

Example: For our web site (ideaeng.com), I ran the Google search:

home site:ideaeng.com

Google reported 78 documents back. Our own search engine shows only 71. At some point I will need to track down the discrepancy, but we're within 10% or so, and therefore the results seem at least reasonable.

Public sites: Using a hosted engine like Freefind

Vocabulary: ASP / Application Service Provider: A vendor that offers software or services that are provided completely over the Internet, versus having to install software on your local server. Also called a "hosted" service.

There are many "hosted" search engines, meaning the search engine reside on the vendor's server and you don't need to install any software on your own machine. This is in stark contrast to traditional vendors like Verity, Endeca, FAST and others which require you install the software on your own server.

Some of these hosted search engines are actually quite good, especially for simpler web sites. Our web site isn't particularly fancy, so we've used Freefind for quite some time. But even if you use one of the more traditional search vendors, you still might consider signing up for one of the hosted services as well. Some of them are free, or are quite reasonably priced. A service like Freefind can spider and index your site in parallel with your traditional search engine, so you can compare the results of each. A service like this could also act as a "hot backup" for your primary engine.

Private / Intranet Sites

This is not quite as convenient to spot check, but still worth the effort. One idea is to install and use another spider inside your firewall. If you search on Google for open source spider you will find lots of choices. We've heard good things about Nutch.

Some vendors also offer low cost spiders to enterprises. If you data is file system based, you could use the Unix "find" command, something like:

find /vol1/corpdocs -type f | wc -l

This number could be off if some files are in the file system but are NOT linked to by other web pages. If you data resides in a traditional RDBMS then perhaps SQL can give you an estimate, with something like:

select count(*) from legal_documents

Of course the query you would use will be application specific.

Other Basic things to Check

Are your indexes up to date? We often find clients who are unaware that their indexing jobs have not been running for days or weeks.

Check the file system dates on the search engine index files; the most recent file date should be a good indicator.

Are you capturing non-HTML content such as Word, Excel, PDF, etc.? While the public Internet is content with mostly HTML content, Intranet applications typically have a high percentage of other document types. In your search engine, do a search for each mime type, and then compare that to the totals for that mime type found by other means. Checking on a per-mime-type basis provides a much better spot check than just checking the grand total of all documents.

Consider a Thorough and Automated Audit

Vocabulary: Vertical Application: In this context, a highly specialized search application, which may be more complex than a "generic" web search application. Examples would include a pharmaceutical research database, legal evidence management and discovery, a corporate or technical documentation library, or managing regulatory and compliance documents.

Vocabulary: Compliance: In this context, insuring that 100% of data is represented and searchable in a vertical application. For example, making sure that a search for particular client's name will always reliably bring back all pertinent records. Vocabulary: Sarbanes-Oxley Act: AKA: "SOX": Compliance regulations relating to what information companies must maintain and provide. SOX compliance is often related to Knowledge Management Systems and related search technology. See http://www.sarbanes-oxley-forum.com

Many vertical databases have very stringent data quality standards. Failure to index and retrieve any piece of data may have legal and/or financial consequences. If your application is in this category, then we urge you to create an audit process that can be run automatically on a regular basis, and that reports any non-compliance in a clear and highly visible way.

A manual audit is better than no audit, but processes that are not triggered automatically have a tendency to not be carried out reliably after the first few cycles, and inconsistencies cause by human inconsistencies may make systemic issues harder to spot. At a minimum, an audit process should compare an exact list of URLs from the search engine with an exact list of document keys obtained by some other method.

In many cases the URLs or keys will need to be "normalized" in order to be compared reliably. As an example, suppose a vertical application is tracking employee records. In the corporate database, the records might look something like this:

Employee_ID	First_Name	Last_Name
	11		Abe		Baker
	12		Cindy		Dunn
	13		Edward		Funk

The URLs returned by the search engine for the employees might look like this:

https://corp.yourcompany.com/hr/cgi-bin/viewEmployee.cgi?employeeID=11
	https://corp.yourcompany.com/hr/cgi-bin/viewEmployee.cgi?employeeID=12
	https://corp.yourcompany.com/hr/cgi-bin/viewEmployee.cgi?employeeID=13

Obviously you can get SQL to give you a list of just the employee Ids with a query like:

select Employee_ID from employee order by Employee_ID

and you will get back a list like:

	Employee_ID
	11
	12
	13

But because the lists are in different format, human review of the results can be prone to error. The two lists need to be in the same format so you can quickly and accurately compare the results.

Once you have two lists of "normalized" document keys, you can compare them. On Unix the sort and diff commands can be very useful. SQL or Perl are also useful tools.

Checking Process Return Codes and Logs

There are quite a few basic things that can automatically be checked for as an indexer or spider runs. We stress automation because, really, how long would it take you to notice if indexer was failing!? Most clients to check this manually with any frequency..

Some of the things you might want to check include:

Check for process return codes using the mechanism in your operating system. Examples: onerror on Windows, $? in Unix's KSH, etc.
Do you check index process logs for errors? Scan your logs looking for signs of trouble, such as errors and warnings. The Unix grep utility can be very useful for spotting error messages.
Check for an instance of the process that may still running from yesterday. Make sure your script doesn't "clobber" itself if the instance that was started yesterday is still running. Consider using a "lock file".
Check the length of time it normally takes for your indexing to run. Warn if it suddenly took much LESS time? A spider that mistakenly did nothing may exit quickly and not complain, but may not have accomplished its tasks. Warn if it suddenly took much MORE time? This may be OK: perhaps more content has been added. But if it takes more than 20% longer than average to run, it's probably worth a peek.
Check on the URLs listed in the indexer's log files. The system should at least report files or URLs that were NOT properly indexed. Does your spider show you what WAS indexed? That might be worth a spot check every now and then.