Adjusting Search Engine Relevancy

« NIE Newsletter

Improving results list relevancy can greatly reduce visitor frustration and improve overall visitor retention. A well-tuned search engine can also reduce calls to customer service and support, as visitors are able to find answers to questions on their own.

The average Web search term entered by surfers is one to two words (1.4 words); search engines often have trouble ranking the large amount of matching documents. But many enterprise-class search engines allow site developers to modify the visitor searches before they are handed to the underlying search engine. This important but often-underutilized feature can be used to improve results list relevancy, returning better results to your customers. The basic idea is to augment the usual one or two word visitor searches with additional weighted terms so that important documents are given a higher "score". We will refer to these adjustments as "search tweaking".

This is not an exact science. For any given set of search-engine tweaks, a sharp QA person can probably find edge cases that are not helped by your adjustments, where the "best" web page may actually be pushed down in the results list. If you embark on such a project, be sure to set expectations appropriately. Also, you should have a way to easily enable and disable search tweaking, perhaps even making it a check box on your advanced search page. You might also consider bypassing your search tweaking logic if you detect advanced query syntax; if a visitor is using advanced query syntax, presumably they know what they are doing, and modifying their search may only serve to frustrate them.

General areas of Search Tweaking:

Favoring documents or web pages where the search term appears in the title, or in other prominent meta fields, such as "keywords" or headings.
Favoring documents where the matching "word density" is higher
Favoring documents that are frequently clicked on as the result of a search, or that have many links pointing to them
Favoring NEWER documents
Favoring exact matches over "fuzzy" matches
Favoring exact phrase matches over partial phrase matches
Favoring documents found in certain sections of your web site, perhaps by looking at the URL.
Favoring small to midsize documents over large documents
Favoring documents in a certain format; for example if instruction manuals are known to be in PDF.

By default, some search engines already employ some combination of the first three techniques. So before tweaking, you should check your documentation or do some testing. If your engine doesn't implement one of them, it's likely a good place to start! As examples, by default Verity K2 does consider word density in its scoring; Ultraseek favors matches in titles and heading zones; and Google is well known for its popularity-based ranking techniques.

Examples:

Some examples using Verity's query syntax. These are "sub queries", which would later need to be combined with other sub-queries and the main search terms to form the complete, weighted search. These examples assume a one-word search of "upgrades".

Matching the word in the title:

<word>('upgrades')<in>title

Matching the actual word in the URL:

URL<CONTAINS>'upgrades'

Matching recent documents:

date > today-30

When matching "recent" documents, it's important that the dates of your content be accurate. In last month's issue we discussed how many web servers are misconfigured in this regard, so may give false dates; in such cases, the above query would likely match ALL pages on the site, which would not be very helpful.

An alternate workaround for matching recent documents:

URL<CONTAINS>2003

Clearly the above workaround is not going to work as well as actually fixing the dates on your web site. Many URLs don't even have the year in them.

You could even look for the year by itself:

<word>('2003')

Of course the syntax for performing any of these sub-queries will be different for each search engine, so please consult your documentation.

Once you have created sub-searches that match on these ancillary qualifiers, it's time to combine them with the main search. Most search engines have syntax for combining sub-searches, and even given each sub-search it's own weight. For example, the open source Lucene search engine provides a way to "boost" certain sub-searches. In this Verity example, we will use Verity's <accrue> operator, which acts like a traditional "OR" operator, while additional weight given when multiple sub-queries match. The Verity syntax for weighting uses square brackets.

Please note that subtlety in weighting will often provide the best results. The idea is to slight boost certain documents; if you give the sub-queries too much weight you can actually overwhelm any internal ranking that the engine is trying to provide. Also, remember, the main query will almost always match by itself, so it will already have a rather large score. The idea is to just "add" to that score, not to "replace it". Our advice: start with small weights first!

Here is a partially expanded search for "upgrades", showing the Verity syntax for combining the sub searches. It is displayed on multiple lines for readability, but would normally all be on one line.

<ACCRUE>(
    [0.8]<ACCRUE>(
        [0.90](<word>('upgrades')<in>title),
        [0.60]'upgrades',
        [0.80](URL<CONTAINS>'upgrades')
    ),
    [0.5]<ACCRUE>(
        [.10]size<75000,
        [.10](date > today-7),
        [.05](date > today-30),
        [.20](URL<CONTAINS>2003),
        [.15](URL<CONTAINS>2002),
        [.10](URL<CONTAINS>2001)
    )
)

This search is actually nested into two main branches. The upper branch is concerned more with "direct evidence" and is given an overall weight of 80% under the main accrue operator. The secondary branch is more "extra credit" for items not directly related to the search term. Notice that the product of its efforts are only weighted at 50%. Verity offers many other advanced query language features and operators, as do other enterprise-class search engines.

Finally, there are several common mistakes sites can make that will actually break or seriously impair the built in search ranking algorithms of many engines.

Common items that BREAK search engine relevancy:

Problem:

Common words used in navigation menus; these words appear on every page, such that searches for those terms will bring back most of pages on the site!

Possible Fixes:

See if your search engine allows for masking out subsections of individual pages. Alternately, have your server side-scripting look at the HTTP user agent field and remove the offending words when the spider is visiting - of course you will still need to provide the hyperlinks themselves!

Problem:

More than one logical document on a single physical web page, AKA "FAQ format" pages. If a visitor has typed a multi-word search, one word might match "faq # 5", while another word might match "faq # 9", but the search engine will mistakenly treat those two matches as logically related and return that page. Also, the titles in the results list will not reflect the specific FAQ that matched.

Fix:

BREAK UP these FAQs into SEPARATE HTML pages. Also, make sure that the main listing of FAQs, which will still contain lots of keyword-rich titles, is NOT included in your site's search index.

Problem:

"broken" meta fields. Many sites have stuffed their meta fields with lots of ancillary words to try and attract the attention of Web Portal search engines such as Yahoo and Google. This extra noise can often cause false matches within your own site. Also, many portals have stopped looking at these meta fields anyway, specifically because webmasters were "stuffing them".

And an even bigger, related problem is that many sites take a meta-bloated template and COPY IT over and over again, as they create new web pages. Since the meta fields are not directly visible in the browser, they often forget to edit them for each individual page. As a result, hundreds of pages on their site may have the exact same meta fields.

Fix:

Review and edit your meta fields. And realize that, in this day and age, bloated meta fields don't guarantee you a good ranking on portal sites anyway.

Problem:

Over expanding searches. Many engines offer "fuzzy" matching operators such as word-ending wildcarding ('ed', 's', 'ing', etc), "soundex" or common misspellings, or even operators like Thesaurus matching. These operators will match pages that may not contain the original search term that was entered. When used in moderation, they can be helpful. But when overused, these operators can bring back thousands of irrelevant pages and swamp your engine's relevancy raking.

Possible Fixes:

One fix is to offer these query-expansion operators as an option, perhaps a check box on an advanced search form. Or to offer query expansion to the visitor if zero documents are returned by the original search, so that they can easily resubmit their search with these options enabled, with a single click. This puts the visitor in control.

If you want to enable these expansion operators by default, consider ranking them at a much lower weight. You might give "direct evidence" an 80% ranking, whereas query expansion or fuzzy matches are given only an additional 20%.

One final thought, though not technically a "fix", is to consider that bringing back zero documents is sometimes better than bringing back completely irrelevant matches. If a visitor searches for something that really isn't on your site, and you bring back 3,000 bogus matches, the visitor may lose confidence in your site altogether; maybe bringing back zero would prompt them to try a different search. Webmasters may be reticent to ever bring back zero hits, for fear of losing visitors, but there are other strategies for addressing this.