Search this site:
Enterprise Search Blog
« NIE Newsletter

Nine Tips for Better Document Titles and Summaries

Last Updated Mar 2009

By: Mark Bennett, New Idea Engineering, Inc. - Issue 8 - April / May 2004

  1. Missing Titles
  2. Many Titles are Identical
  3. The first part of many titles look that same
  4. Titles with invalid characters
Intermission: a briefing on document summaries
  1. No document summaries provided
  2. Identical document summaries
  3. "Stuffed" Summaries
  4. Spurious HTML tags in Summaries Trash the Rest of the Page.
  5. Poor automated summaries

1. Missing titles

Your search returns a list of documents where the title for each document is just the name of the file.

Likely cause: your search vendor or open source search solution can't extract or parse simple titles.

    Where are titles typically extracted from?
  • HTML Documents: typically these are extracted from the <title> element.
  • E-mails and News Postings: they are often the subject line.
  • For binary document formats, such as Microsoft Word, they are taken from the document "meta data" (File menu / Properties )
  • For plain old text files, which have no predefined title meta field, some engines will simply take the first line or block of text as the title.

Fix: it's time to upgrade.

Likely cause: You have a number of Microsoft Office documents in your collection

Fix: make sure your content creators are checking document properties and entering a reasonable title before submitting the document.

2. Many Titles are Identical

Your search returns a list of documents where the titles for many of the documents are identical.

Likely cause: Use of templates with a boilerplate title.

Fixes:

  • Talk to your document authors.
  • Use a content "tweaker" as part of your spider process to look for better titles in the document body.

3. The first part of many titles look that same

Your search returns a list of documents where the titles for many of the documents are identical. Then you look more closely and it's just that they share the same (long) beginning text. For example:

  • Acme's World Wide Web Customer Service Site Frequently Asked Questions: Igniting Your Rocket Propelled Skates
  • Acme's World Wide Web Customer Service Site Frequently Asked Questions: Avoiding Ferrous Materials when Operating the Super Mega Magnet
  • Acme's World Wide Web Customer Service Site Frequently Asked Questions: Invisibility Paint Safety Precautions

The long title prefixes clutter the results list and obscure the real content.

Likely Cause: This is usually also caused by the use of templates, or perhaps a misguided corporate Look and Feel policy.

Fix: modify your templates or policies to encourage shorter prefixes: in the example cited a prefix of "FAQ" or, at most, "Acme FAQs:" would be preferable.

Long titles, if they are truly unique and informative, are not necessarily bad. We mentioned in Omit or Truncate the Display of Long URLs in the Search Results from "Top 10 Tips for Better Search Results" that long URLs in a results list can push out the right edge of the results table. Typically long titles will not do this, they should word wrap properly within the table. An exception--and potential problem--is when the titles are bracketed by <nobr> tags:- they can break the formatting of your search results tables. For that case we suggest either removing the <nobr> tags from around the title, or truncating the title at some preset limit.

4. Titles with invalid characters

The titles returned for documents have one or two gibberish characters.

Likely cause: This is often caused by a malfunctioning document filter, bad meta data, or character set encoding issues.

Fixes:

  • Adjust your spider or talk to your search vendor.
  • Put a tweak in your results templates that does a sanity check on title. If it's too short (e.g. less than 5 characters in length) assume that it's invalid and default to a secondary title or perhaps the file name. Using the filename is not great but less ugly than using a handful of control characters.

Likely cause: for HTML documents, this can be caused HTML elements being included in the title.

Fix: Remove extraneous tags.


Intermission: a Briefing on Document Summaries

Next we turn our attention to document summaries. There are three common types of summaries offered by search engines. Don't fret too much about which of these methods your engine uses - any of these methods is typically better than nothing at all. Some engines offer choices - make sure to read up on what you have available and test the various settings.
  1. Explicit summaries
    Summary is specifically stated by the document author. In HTML, this is the description meta tag in the <head> section of the document. For binary documents such as Microsoft Word, this is set as a "document property", often under the File menu. A benefit of this type of summary, when used properly, is that content authors have the control needed to present a precise document summary, which may prove more useful than the automatically derived summaries (described below). However, this does take consistency and discipline to fully realize the benefits.
  2. Index-time derived summaries
    A block of text is extracted from the document when it is being indexed, and used as the summary. When this document matches a search, it will always be displayed with the same summary, regardless of what the search terms were. For HTML documents, most engines will prefer to use the fixed meta tag summary, if it's present, and will only resort to this as a fallback. Vendors often have settings for how much text to extract, either measured in characters or words or sentences. In terms of which text the document to use, some engines take text near the top of the document, presuming it's likely to be relevant; this can cause problems in HTML pages if the top (or left edge) of the document contains lots of navigation links that the engine mistakes as pertinent text. Some vendors support embedded tags that help demarcate central document content from extraneous content within each page. A symptom of this is when you see lots of summaries with "Home | Products | Services | About Us?" type text, you will need to investigate tuning your settings. Other engines try to extract statistically "important" sections of the document for the summary, where "statistically relevant" is determined by each vendor's algorithms. Some vendors also allow adjustments to this section by letting the administrator specify words that know to NOT be interested, such as terms that appear in navigational parts of the page.
  3. Search-time derived summaries
    These are the fancy summaries some engines have that show the part of document with the key search terms in them; the document will have a different summary each time it is in the results list, depending on the search terms in each search, sometimes even highlighting the actual terms in bold. This is certainly the "sexiest" type of summary. Overall, if this is working well, then you should consider using it.

We now continue with what can go wrong with summaries and how to fix it.


5. No document summaries provided

This has become more rare, but some search engines do not provide summaries: instead they provide a long list of clickable URLs. A few power users might like this, so they can see 50 results on their screen at once, but mere mortals like to see summaries.

Likely cause: Document summaries not turned on.

Fix: Turn them on. If they are not available consider upgrading / changing search engines.

6. Identical document summaries

Likely cause: This is also usually caused by the use of boilerplate templates.

Fix: revisit your templates.

7. "Stuffed" Summaries

In the old days of the Internet, back in the late 1990s, webmasters tried stuffing their summaries with key words to boost their ranking on Internet portals. Portal indexing spiders got wise to this a long time ago, so this practice is essentially futile (in terms of portal rankings). But, if you still have these summaries in place, you probably will succeed in confusing your own enterprise search engine, and provide very poor summaries in your results. If you've got any of this cruft still hanging around, get rid of it.

8. Spurious HTML tags in Summaries Trash the Rest of the Page.

Example: A summary includes an opening <b> tag for bolding, but does not include the closing </b> tag.

Result: the rest of the results are bold too! This can be particularly disturbing if it pushes the "Next Page" link off of the visible page.

A more serious example is when a summary contains an opening table related element such as <table> or <td>, but does not contain the closing element. This can be particularly disturbing and may even prevent the rest of page from displaying.

Fixes:

  • If the vendor supports it, ask for a summary that does not include HTML tags, but that DOES include entities. Fortunately this seems to be the default for many modern search engines.
  • If the search vendor does not offer this, you can still write a script to remove it, based on searching for < and >.

9. Poor automated summaries

As we mentioned in our briefing on document summaries above, fancy search engines offer a dynamic keyword specific summary.

Here are some things to look out for if you use this type of feature:

  • Some vendors don't include enough surrounding terms to give context, so your summaries wind up with strings of ellipses, gaps, and lots of small text fragments - this is not particularly useful in conveying context and is visually unattractive. See if your engine has adjustments for how much text to include.
  • A better option is for the engine to display complete sentences with the term highlighted.
  • Be careful not to let summaries get too long: limit them to two to four sentences.
  • Performance issues: some engines slow down noticeably because they re-fetch and re-analyze a document EACH TIME it is displayed in a results list. They do this in order to calculate an optimal summary, but on some systems this can slow everything down.