A Cost Effective Model for Database Indexing
IntroductionIt used to be that databases were where companies stored structured data like accounting records or personnel records. To find information, SQL was the language to use, and if you needed help, you talked to the database pros. Unstructured data and documents were maintained in the file systems, and to find information, you use your site search tool. Sounds easy, doesn't it? The problem is, databases began offering support for 'binary long objects', or BLOBs; and it was only a small step from BLOBS to long text data types, and soon database administrators everywhere were loading office format files, PDFs, and other documents into relational tables. A logical storage option perhaps, but even now relational systems typically offer much less rich feature sets for full text retrieval. Enter the database gateway.Nearly every search vendor now supports methods for indexing and retrieving textual data within a relational database. Unfortunately, because large relational database systems cost huge amounts of money, search technology companies often charge a significant premium to index relational data. And while the solutions, often called gateways or connectors, are quite automated, they often limit your flexibility. Write Your Own?We used a database to store large numbers of web pages at a previous company, and we wrote our own gateway for our particular needs. But writing your own code can be complex as well, so when one of our customers wanted to index a relatively small number of database documents, we delivered a simpler solution that might work for you as well. The Customer's ProblemOur customer had several hundred articles stored in an Oracle database which were served by Cold Fusion to generate web content for their users. This presented a problem: the database did not contain the actual HTML page; it only contained HMTL snippets which we expanded at display time by Cold Fusion. Each snippet, however, was stored in a database record with a unique key value. The links from the web site to a given article was of the form: <a href="/cfm/getdb?id=1234>Read Story 1234</a>Unfortunately, very little of this content was linked from existing static pages, so spidering the site would miss hundreds of the articles. A Custom ToolOur solution was to create a page which contained links to every article in the database, and then send the spider to that unique link or "landing" page. This type of solution will work for nearly any textual content stored in a relational database. Extract Database Key FieldsWe chose to implement the solution in Java because of other work we were doing on the system. In fact, we could have used Perl, Python, or nearly any scripting language with access to the database, Oracle in this case. The Oracle database contained articles for several publications, each with a unique Pub_Code. Within that limit, we wanted to capture the article ID for each article. Our plan was to execute a SQL query of the form: SELECT ID from Pubs where Pub_Code='news"If you've used databases, you know that this query will return a list of ID numbers. Once you have this list, you can decide what kind of approach you will use to create the collection. The two methods that are most common are either using a web spider indexer, or to create a "bulk insert" file and use a command line indexer. Which you use may depend on the search engine you are using, how your other collections are created, and on many other variables we won't get into here. Create an HTML Link Page for an Indexing SpiderIf you want to use an indexing spider provided by your vendor, you will need to create an HTML landing page with links to each article by ID. Once you have the list of articles using the above SQL statement, it is an easy task to write code using this logic (though NOT this syntax!): foreach docID in resultList output an HTML link of the form <a href=/cfm/getdb?id=docID>docID</a>By writing this stream to an HTML file, you create a link to every document of interest. And by running an index job with this landing page as the starting point, the standard indexing tools create the collection from live database data. Because our customer elects to re-created their full-text index from scratch nightly, all we needed to do is schedule the customer Java application to run a few minutes before the spider was scheduled to re-create the collection. If you are doing incremental updates, you might need to add additional logic; but you can see how simple the process really is. Even with the custom code, our solution was much less expensive than the database connector and our customer has full control over the indexing process. Maybe you'll find something like this works for you as well! Create bulk load file for Command Line Indexing ToolsIf you want to use a command line utility that supports a bulk load capability, you use a similar logic but the command line syntax is different. And you may need to invoke the command line tool once for each link if your search engine doesn't support an indirect bulk format. Note: Dr Search writes about Autonomy IDX and K2 Bulk Insert Files this month. In such an indirect format, you create links to documents that are to be indexed in a file, and then point the indexer to that file. Assuming you will run the index command once for each link, you use similar logic as above with different output syntax. foreach docID in resultListoutput a command line in valid syntax for your search engine This creates a bulk load file that may be somewhat less efficient than the HTML link (or landing) landing page above, but which may work for your site depending on your implementation. SummaryYou've seen two ways to extract text from a relational database without using a special gateway or connector. It's not difficult, and depending on the complexity of your site architecture, you may find you can save a significant amount of capital expense by simply solving the problem on your own. The gateways and connectors are very nice for some sites, but implementing a full database gateway is complex - and on top of that, we're partial to being frugal! Home | Products | Services | Newsletter | Resources | About Us | Contact Info | Privacy Policy Copyright New Idea Engineering, Inc 1996 - 2008 |