Search this site:
Enterprise Search Blog
« NIE Newsletter

Where did my Verity K2 status messages go after upgrading to K2 6.1? - Ask Dr. Search

Last Updated Mar 2009

By: Mark Bennett, Volume 4 Number 4 - August 2007

A customer writes:

We search customer support problem reports with Verity K2. When we recently upgraded to K2 6.1 we could no longer find some of the status messages in the call logs. These messages are often hexadecimal numbers of the form 0xFF0305. What's up?

Dr Search replies:

This doesn't look too hard, maybe I'll even get home early today! But before I jump into the answer, we'll need to define one term:

Vocabulary: tokenization
Broadly speaking, in search engines this means breaking up text into words. Though this sounds simple enough, just break on spaces. But what do you do with words that have punctuation in them, or that mix characters and numbers? As an example, would you treat "after-hours" as one or two words? Or how would you break up these strings into words: john@acme.com, AT&T, SLEEP_TIMEOUT, X.25 or OS/2 ?

Each search engine has its own rules for how it will break up strings into words, and how it handles punctuation. The more sophisticated engines even let the administrator configure these policies. When K2 went from V5.x to V6.x, the old configuration file STYLE.LEX was superseded; locales and character set settings became much more important in determining what defined a "word".

To make matters worse, K26.1.1 didn't quite work as expected, so words that started with a numeric character and had embedded alpha characters was parsed as two distinct words. In your example "0xFF0305" was parsed as two words: "0" and "xFF0305" (words that starts with alpha characters and contained embedded numbers were parsed properly).

This was fixed in a later version of K2, in "K2 Version 6.1.2 Service Pack 1". Additionally, you must also set two of the fields in the Collection Properties section of the Dashboard in order for this to work (or do the same in your rcadmin scripts). The Locale field has to specifically be set to "uni/en" (universal filters/english); and the CharSet field has to be set to UTF8 (a dialect of Unicode). After taking these steps, you should then be able to search on your status codes, or any numeric/alpha mixed word.

Not so bad... wonder if I can still beat the traffic.