Ask Doctor Search: What's the Role of
"connectors" in Enterprise Search?
Volume 4 Number 2 - April 2007
This month a concerned reader contacted Dr. Search with the following email:
I am trying to get a better understanding on how the
different enterprise search products fit within an
architecture. The most obvious is GSA, in that it
is simply a box that contains a central index of
all searchable content within the enterprise.
That index is built by either crawling or metadata
tagging. Is there a set time that the index is
refreshed or is that customizable?
Now, Autonomy and FAST seem to be a bit different,
as they each have a concept of "connectors" that
are specific to a particular data type. Each connector
"pushes" data to the central index that aggregates
the results and makes them available for use queries.
Does that "push" process occur dynamically when the
user search is initiated? Or does the user search
the central index that is populated by each
individual connector? If that is the case, can
you customize how often the central index is refreshed?
Dr Search replies:
That's a great question! Actually it's three questions,
so I guess no ducking out of the office early today!
- Some vendors talk about "connectors"
and others do not.
Is there some big advantage to using one?
- How often do search engines go back to look for new content?
- Does the use of Connectors change this?
Let's tackle question 2 first:
Most modern web spiders have similar techniques for dealing with
web content; they differ a bit in which pages they revisit and
when, but these are details. Some of them attempt to use
"incremental" indexing for web content
by using a more elaborate revisit schedule.
Most spiders will use one or more of these revisit techniques:
- Revisit all the pages and reindex everything.
This is the "brute force" way
- Revisit all the pages, but carefully check to see if they
have changed. Although the same amount of content is downloaded,
the system saves processing time by not reindexing pages that have
not changed. Some vendors call this "incremental indexing"
- Query the web server about each page, to see if it has changed.
Older spiders did this with the HTTP "HEAD" command;
if the page did change, then they re-request it from the server
in a second transaction.
Newer spiders use a more sophisticated method that effectively says
"give me this page again if it has changed since
this date", and the spider provides that last date it
fetched the page.
- Re-download or re-check each page on its own schedule.
The system maintains a database of guesses for when
it believes each page is likely to change,
and then adjusts future guesses for each page when it finds out
how far off it was.
This is what some vendors call "incremental" indexing.
Notice that if a page had been unchanged for many spider visits,
and then is suddenly changed, the spider may take a long time
to notice.
- Re-download or re-check certain parts of the web site
more often than others. For example, check the home page
every hour, but only redo the archives once a week.
- Have a special way for the web server to tell the spider
what has changed. Ultraseek does this with the sitelist.txt
file, and modern blogs do this with "blog pings"
The administrator can usually adjust how often
a spider revisits pages, though different spiders offer different
levels of control.
Most of these methods have some serious limitations.
Also notice that different vendors use the word "incremental"
to refer to different techniques. If incremental spidering
is important to your application, make you each vendor that offers it
to you clearly explains what they mean.
Now for questions 1 and 3, what's up with these
"Connectors" anyway?
A more fundamental difference, as you suggest, is the "connector"
architecture that some vendors offer. Generally speaking,
native repository connectors offer better integration than
generic "web crawling".
Let's assume I have CMS (Content Management System) application,
which I'll call "XYZ Super CMS"
There are at least 8 ways I could theoretically search that data!
- Continue to just use the web spider, pretending to be a web browser user.
The spider will see and index the same HTML content
that a user would see.
- Use the proper XYZ connector from the search vendor.
This keeps the search engine index up to date all the time.
- Go after the tables with the generic "database" connector;
since most CMS systems use a database as the back end
- Export a "dump" of XYZ into XML or some text format,
then have the search engine index those exported files.
- Import the data from the CMS into a search enabled database
such as Oracle.
- Write a custom connector with the search vendor's API to
"inject" data.
- Don't index. Just use "Federated Search".
Use the CMS's built in search capability, and then
just combine those search results with all the other search
results.
- Try to cheat and use the CMS's search engine index directly.
For example, if your search engine used K2, and you also use
K2, then try to attach to their K2 collection and search it
along with your other mounted collections.
Phew! So many choices!
There are limitations and tradeoffs for all of these methods.
I'll summarize the pros and cons here, but if anybody needs more
details on a particular method, please drop us a line.
Method 1 misses a lot of Meta data, including document level security.
Also, method 1 must re-poll to check if a document has changed.
Method 2, on the other hand, will have access to all the data,
and also know precisely when a document has changed.
And yes, as you suggest, most connectors know precisely
when content has changed, and will immediately
send that change to the search engine.
Technically speaking, some connectors still use "polling"
vs. direct "push", but it's likely to be a very concise
form of efficient polling, so is still a vast improvement over
the normal polling a generic web spider would use.
If available, method 2, using a specific connector,
would be the preferred method.
Method 3, using a generic database connector, will be more complicated to
set up. Do-able, but more complicated.
You would still need to get your search vendor's
database connector; they might also refer to it as their ODBC connector
or database gateway.
Method 3 would also be a potential workaround if the search engine
vendor doesn't offer a specific connector for the XYZ CMS.
See Clinton Allen's article,
also in this issue, for an example of this technique using K2 and ODBC.
All of the other methods mentioned above have more serious limitations
or added complexity, and might prove problematic in a production environment.
Methods 4 and 5 are "batch oriented" so would likely be
even less responsive, and
also somewhat complicated to setup.
Methods 6 and 7 are generally complicated and will usually require some
formal programming; therefore they should be considered only when all other
techniques are unfeasible.
Method 8 requires a coincidence of you and your CMS vendor both
using the same search engine, and usually the same version.
Even if all that worked out, there may be enforced licensing issues
or other technical issues that make it a long shot.
Therefore method 2, using a connector,
would be preferred because:
- It has access to complete meta data, including ACL info.
- It knows precisely when to grab a document
(either via push or very efficient poll methods)
Two issues you might face however would be:
- The specific connector you need may not exist.
Remember, it needs to be specific
to both the XYZ CMS system and your particular search
engine platform.
Generally you would contact the search vendor, vs. the CMS vendor.
If it doesn't exist, consider option 3, using a more generic
database connector, as a workaround.
- It might be an extra cost option from your search vendor.
For some vendors it could be in the range of $25,000 to $100k.
On the bright side, some vendors bundle in one connector license with
the base engine, so check your original purchase paperwork to see
if you've already paid for it.
The details of installing connectors do vary widely by search engine vendor
and
by the specific connector, but it would generally be expected to
perform better than using the web spider / crawler.
Given the potential complexity, you may want to get some help setting it up.
We hope this has been useful to you; feel free to contact
Dr. Search directly if you have any follow-up or additional questions.
Remember to send your
enterprise search questions to Dr. Search. Every entry
(with name and address) gets a free cup and a pen, and the
thanks of Dr. Search and his readers.