Vocabulary: Meta Data: Extra attributes of you document, beyond just its raw text.
Examples include Title, Last Modified Date, Author, etc. In HTML content, often defined
with the tag in the header of web pages.
Basic Meta Data
Basic meta data is common to almost all documents, and includes basic items like document title,
last modified date, and document author. In some systems, the document summary is also a meta data field.
Here are some things to keep in mind:
Are you Dates Correct?
Quick Check
Do a few simple searches with your search engine and look at the results list.
Assuming the date is displayed in your results, look at them closely:
Signs of trouble:
Blank dates
All have today’s date or other recent date
The first one, blank dates, is pretty easy to spot. The second problem, which is far more common,
is often less frequently noticed. If all the dates on your site have a very similar data, and many
pages have the same data, you probably have a misconfiguration in your system somewhere; is it really
likely that 90+% of your web site was updated yesterday!? By default many web servers return the current
data and time in the HTTP “last-modified’ header field; many search engine spiders take this field to be
the last modified date of that document. Thus, the dates of the documents incorrectly reflect the day the
spider was last run.
Fixing this is heavily dependant on your web server, the application itself, and the search engine you are using.
One possible solution, if you can’t fix the HTTP header dates, is to have your application include a
<META> "http-equiv" tag with the correct date – some search engines will pick this up and
use it in place of the HTTP header date.
You can also include a time component in the Last-Modified field, if your application requires that level of precision.
Thorough Audit
As with other aspects of data quality, a thorough audit may be in order. A general strategy is to dump
all dates to a file or database. The procedure for getting your search engine to export its meta data to
a file or database varies by vendor. You might look for an export utility, or use the vendor’s API, or
consider a 3rd party application like NIE’s SearchTrack (link)
You can do a programmatic scan for quite a few statistics; and since dates can be treated as numbers,
some mathematics can be applied:
Documents with blank dates
Min date / max date / average date
Calendar day with the least and most documents
Going further, you can create a histogram graph of dates; for each date, have a bar for how many pages are there,
sorted by most recent date first.
Good: Graph looks reasonably smooth
Suspect: Graph has 1 or 2 giant spikes or “stair-steps”. Do these correspond to major site updates?
Even if you site had a major overhaul, the content probably didn’t all change. Simple page changes such as
navigation bar changes, layout changes, and even rotating ads can fool some systems into thinking that the
content itself was changed. When you see spike in this graph related to site changes, ask yourself if this
was really a content changing or not; if the content didn’t really change, then neither should the dates have,
so you may have a configuration issue.
How about your Titles?
As with dates, you can do a quick spot check and visually scan the titles in a few results lists.
Some easy things to spot:
No titles
URL or file names as titles (some engines default to this if no real title is found)
Overly short titles such as “info”, “FAQ”, etc.
Titles that are too long
Duplicate titles
Gibberish / control characters in titles
For a thorough audit consider using a script. Don’t forget to normalize titles before doing your analysis.
Trim spaces
Force empty strings to match nulls (both bad from user experience)
Lower case (for duplicate detection)
An automated script can certainly spot some problems automatically and definitively.
Other data issues may or may not be "bad", depending on the context, and will require
some human judgment; however, an audit program can at least flag suspect data and bring it to
your attention for review.
Signs of trouble:
Red flag: How many are null or empty?
Red flag: How many are < 5 characters? (probably invalid)
Red flag: How many non-null duplicate titles do you have?
Yellow flag: strings with 8 bit characters, if you know your site should only have English content
Yellow flag: strings with HTML characters <, > and &
Yellow flag: How many are < 15 characters? (or some limit that you pick)
Yellow flag: How many are > 80 characters? (or some limit that you pick)
Yellow flag: Titles with long common prefixes:
This last item, long common prefixes, needs a bit more explanation. Some companies have practices
where the full name of the company is prepended to every title (perhaps to aid in Google search results).
Or perhaps each page’s title includes the lengthy name of the section of the site the page is on
(perhaps trying to help visitors recognize which page they really want) But taken in the extreme,
this can produce really bad titles.
Example:
Acme Online Customer Support: Frequently Asked Question: Installing the MA555 Module
Acme Online Customer Support: Frequently Asked Question: Calibrating the GS491 Meter
…etc…
As you can see, the first 50+ characters are always the same, pushing the real title off to the side.
Perhaps a shorter prefix for these pages would be more useful, such as:
Acme FAQ: Installing the MA555 Module
Acme FAQ: Calibrating the GS491 Meter
…etc…
Application Specific Meta Data
Vocabulary: Vertical Application (repeated from Part I): In this context, a highly specialized search
application, which may be more complex than a “generic” web search application. Examples would include
a pharmaceutical research database, legal evidence management and discovery, a corporate or technical
documentation library, or managing regulatory and compliance documents.
Many specialized applications have Meta data above and beyond the standard "title / data / author" variety.
These vertical applications often contain fields used in reporting and processing, but these same fields
can also be used in searches. Search engines must be configured to look for and record these additional attributes.
Some examples:
A Tech Support Call Tracking system might also include meta fields such as "customer ID",
"customer address", "product name", "product version", "maintenance contract", etc.
A legal documents database would likely include client information, case information, etc.
One way to get this data into search engines is for the main application to generate HTML content
that includes <META> tags for each specific field. Many search engines readily understand this
type of data. With other search engine setups, it may be necessary to configure the
"database gateway" to include these additional fields.
Meta data can be subject to both spot check and automated data quality tests.
Things you can look at with Meta Data:
Compare the complete list of meta data items that SHOULD be there to what is actually being
indexed by your search engine
Then for each of these fields:
Looking at all the documents, are any fields missing? Is this OK for this particular field? (some fields may not be required)
Are ALL the field values missing? This is certainly suspicious.
How many unique values does this field have? For a value-constrained field such as “author” or “color”, this should be much smaller than the total number of records. Sometimes this number is larger than it should because the same data is entered in more than one form, such as “Firstame Lastname” vs “Lastname, Firstname” – you may want to normalize this data during input, or when it is exported to the search engine.
For fields that should be unique, are there any duplicates?
For “typed” fields, run them through a validator. For example, for dates, have a script try to parse all non-blank dates.
Perhaps visually scan the list of unique values for each field.
A majority of this search engine data quality series is focused on "data prep",
making sure that your search engine has an accurate and up to date representation of your data.
However, there are a few aspects of data quality that come in to play only after data prep,
when a user actually runs a search and looks at results. This type of information can be found by
looking at your search results, and also looking at your search activity logs.
Looking at Search Activity Logs
Search logs are a miraculous resource for understanding not only your search engine, but also your visitors.
Note, we’re not talking about "click-tracking" – you should be looking at reports that directly show the
searches your users typed in, and what they got back for results; old-fashioned click-tracking required
too much guesswork to interpret – search logs show you exactly what visitors were thinking.
Some basic things to look at in search activity reports:
What are the top 100 searches on your site?
Do they all bring back results?
What are the top 100 searches that return no results?
What searches return > 10% of your content?
Who are the most frequent visitors, and what are they searching for?
For each search, which document is most frequently clicked on?
By looking at search logs, and taking appropriate actions based on your discoveries, you can:
Suggest more pertinent pages or adjust document ranking
Suggest related documents or vocabulary
Improve your content
Improve site navigation
Are Results Relevant
Do a spot check by running your site’s top 10 searches and look at the results:
Do the documents returned seem relevant?
Can you think of better documents to return?
For a more thorough audit, create bar graphs showing the Relevancy Histogram of first
100 documents for each of the 10 searches. For each segment in the score (for example,
100% down to 95%), the bar represents how many documents were in that range.
Normal: Is there a curve that slopes down as you go to the right?
Normal: Does the curve flatten out as you go the right, and get steep near the left?
Is the curve smooth, or does it have big "steps"?
Big "steps" indicate a scoring algorithm that might benefit from some tuning.
Are users clicking on the documents that are returned by searches?
Or are they instead, for example, clicking on the 4th or 5th document down for some searches? Users will show, though their actions, which searches are “working” and which are not.