The Art and Science of Query Cooking

« NIE Newsletter

The Art and Science of Query Cooking

What is query cooking and why would you want to use it?

Simply stated, query cooking is rewriting a query the user entered before running it. There are many reasons why you may want to do this. You could modify the query so that hits in more relevant fields are displayed first in the returned results. Security could be added to the query. You may want to optimize the query so it runs faster and uses fewer resources. You could also tune the query so that the results are what the user really wanted.

Improving the Search Experience

Each company has its own ideas of how the user's query should behave. Maybe one field should be given a higher weight than others or synonyms should be looked up and added to the query. A common problem is searching for names, as in an Author zone. Sometimes names are listed with the last name first separated by a comma as in "Fiedler, Arthur". Other times names may include the middle initial as in "Alfred E Newman". One way to improve the result set retuned by Verity when a user searches in the Author zone, is to cook it first. Here is an example of how one customer solved the problem:

(Tony Hillerman) <in> Author

becomes:

<near/3>( Tony , Hillerman) <in> Author

Here is another example which demonstrates a rule set that is applied to the Author field when there are 2 or more words only. Notice the query cooking has only been applied to "Arthur Conan Doyle":

(Arthur Conan Doyle, Madonna) <in> Author

becomes:

<near/4>(Arthur, Conan, Doyle) <in> Author <or> (Madonna) <in> Author

This looks like a straight forward transformation until you realize that the user may enter a more complicated VQL statement as in:

(Tony Hillerman) <in> (Author, Title, Description)

or worse yet:

(Hillerman <not> Tony) <in> (Author, Title, Description)

The Title and Description zones may have rules specific to them or no rules at all. The <not> must also be dealt with in a way that is consistent with the user's intent. Raritan Technologies has developed a VQL Parser with hooks that allow for writing field specific rules taking into account word occurrence and addresses the above problems.

Weighting the Results

Often in your Verity collection there may be a field or zone which has more relevant data contained in it. An example of this may be a video store, where a hit in the title field has more relevance than say, the full description field.

When a user types in "Ground Hog Day" you want the first result to be the movie itself. Other movies may have descriptions written "... is very similar to Ground Hog Day" or "... also starring Bill Murray who was in the movie Ground Hog Day". So how do you make sure the result referencing the actual movie is returned first? The answer is weighting. Verity has a weighting operator where you can multiply the score for a portion of a query by a factor from 0.1 to 1.0. The syntax is

[weight](query)

Using our example above our query could look something like:

[1.0]( "Ground Hog Day" <in> Title ) <or> [.7]( "Ground Hog Day" )

Remember that the weighting factor is multiplied by the calculated score from Verity. This is a proprietary scoring algorithm and has to do with density of hits and other factors. So to be absolutely sure that the record with "Ground Hog Day" is returned first, you can force exact scores to returned results. There is a <yesno>> operator in Verity which converts any score greater than 0.1 to 1.0. All other scores are 0. Our new query where we can guarantee that "Ground Hog Day" in the title will appear first is:

[1.0]( "Ground Hog Day" <in> Title ) <or> [.7]( "Ground Hog Day" )

There are other weighting operators (<logsum> and <mult>) which can give you a finer control of weighting results. They are beyond the scope of this article and are well documented in the Verity Query Language manual by Verity.

Optimizing Queries

Most users have little or no knowledge of Verity Query Language syntax or how one query will outperform another. By cooking their query, you have a chance of making it run faster. The Verity documentation states that zone searches are faster than field searches. If your collection has some fields that are also zones, you can take advantage of this knowledge. For example, if a user was searching for "bread" in a Title field and the Title field was also a zone, their query:

Title<contains>bread

Could be optimized to:

bread<in>Title

You may also monitor queries that users search for and may make a topic (which is a verity compiled query, again beyond the scope of this article) for the most popular. If you built a topic called:

{Michael_Jackson}

which is equivalent to:

<near/2>(Michael,Jackson)<in>news

and a user searches for:

Michael Jackson<in>news <and> Oprah

you could optimize the query into:

{Michael_Jackson} <and> Oprah

The best way to implement optimizations like the ones shown is to first keep a search history log showing 1) what the user searched for, 2) how long the query took to run, and possibly 3) the number of results returned. You then have data to analyze the areas of poor performance and which queries are taking too long to run. Search logs analysis will also allow you to build synonym lists (which can be used with the Verity thesaurus or "cooked" into the query) for better quality search results. Optimization is a complex issue and there are many contributing factors to how well Verity performs in general. These should all be looked at prior to embarking on a large query optimization project.

Security

Verity supports the common security schemes, LDAP, etc. But what if you have a field in your collection like subscription? This is a common issue for content providers who publish content on a subscription basis. You wouldn't want a user who only subscribed to "sports" to be able to search "news", "entertainment", or other sections. One way to restrict access is to use query cooking. Using the previous example, if we have a user who only has a subscription to 'sports' and they enter a query:

"Mickey Mantle"

You can cook there query to:

"Mickey Mantle" <and> ( sports <in> subscription )

Obviously you would need to look up the person's subscription level before adding the security clause. Again, topics could be used as well as source queries which are beyond the scope of this article.

Wrapping Up

Query cooking can make the user's search experience much more productive by returning the most relevant and accurate results. Query cooking can add security restraints to specific users. Query cooking can optimize queries for the fastest performance. If any of these are of interest to you, call Raritan Technologies at (908) 668-8181 to learn about our products which can make query cooking simple and painless.