Autonomy K2 v6: Indexing an ODBC Database

« NIE Newsletter

Autonomy K2 v6: Indexing an ODBC Database

This month Clinton gives us a Quick Start for building a K2 collection with the ODBC driver and Vspider. K2 is currently sold by Autonomy, though some people still refer to it as Verity K2.

Step 1: Introduction / Understanding the Process

Indexing a database using ODBC and vspider requires more configuration than most tasks using vspider, and the documentation can be confusing. Here we provide an example to help get something running and to serve as base for further development. To keep it simple we focus on indexing records from the sample database table EMP provided by Oracle. We will only need to make minimum changes to the sample K2 ODBC style files, using default values where ever possible. In this article we're only going show how to index one text column in the table; of course we will also have to index one other column to serve as the primary key. At the end of the article we've included a link to download the full set of edited style files configured to index all of the columns of EMP.

Although the steps listed below may look complex, they actually break down into a few logical tasks:

Configure ODBC to access your database
Tell K2 which database to connect to, and which fields to index
Have K2 vspider fetch the records and index them
Use K2 rcvdk to do a test search, to verify that the data was indexed

Quick Links / Summary of Steps:

Step 1: Introduction / Understanding the Process
Vocabulary Terms
Step 2: Information You Will Need
Sidebar: Two Flavors of Oracle Drivers
Step 3: Tools
Sidebar: DSN vs. SID - What's the Difference?
Step 4: Defining ODBC Data Sources
Step 5: Copy Template Style Files
Step 6: Add Fields to STYLE.SFL and STYLE.UFL
Step 7: Add the ODBC DSN into ODBCMON.CFG
Step 8: Edit VGWODBC.CFG
Step 9: Populating Fields with VGWODBC.VGW
Step 10: Define the Virtual Document in STYLE.DFT
Step 11: Define SQL Statement in FREESQL_1.TXT
Step 12: Setup Database Authentication in CRED.DAT
Step 13: Start the Indexing with the VSPIDER Command Line Tool
Step 14: Check the Results with the RCVDK Command Line Tool
Step 15: Review and Links to Style Files

Two other points to keep in mind:

There is no single place where you enter an entire SQL statement. K2/vspider builds up a SQL statement automatically, from within its own logic, and using some snippets of SQL that you can provide in various files.
When indexing with K2 and vspider, we've seen it output a lot of errors in log files that look scary, but which ultimately can be ignored. Our point is, check the final results of the index with the K2 search tools. If they look correct, don't worry as much about the errors in the log file. In particular, we believe some of these errors are caused by gaps in non-sequential primary keys. Learn to interpret the log file correctly, and separate out the "common" one from the truly unusual ones.

[ back to top ]

Step 2: Information You Will Need

There are several pieces of information needed to configure K2 and ODBC indexing.

You'll need the DSN (Database Source Name). In this article the DSN is "EMP_TABLE"
If a DSN has not been configured, then you'd need to define it. To define a DSN you would need to know the Oracle server's TCP/IP name or address, and the Oracle SID for the demo database.
The name of the table that you want to index. In our case EMP.
The names and types of the columns in the table, including the table's primary key which will map to K2's primary key (VdkVgwKey). For the EMP table the primary key EMP_NO. We'll index one text field, JOB, from the EMP table.
Accessing a database requires a username and password. For a default Oracle installation the demo username is SCOTT, with password TIGER. These will need to go into an encrypted security file created in step 12.

[ back to top ]

Step 3: Tools

There are several useful tools for obtaining this information and trouble shooting.

You'll need a plain text editor for editing the style files. For example, you can use Notepad on Windows. On Unix/Linux you can use vi or Emacs.
The Oracle command-line client SQLPLUS is handy for verifying the connection to the database. If it isn't installed on the machine running K2, then you may need to run it on the server with the Oracle database or from a PC with the Oracle client tools installed. Once connected you can also examine the tables and columns targeted for indexing. You can also compare the results of a SQL query to the one submitted by vspider.
On Windows you will use the ODBC Data Source Administrator (ODSA) to configure and test the ODBC connection. The ODSA has a "test a connection" button to check that the ODBC credentials and drivers On other operating systems such as Solaris you will need to edit the odbc.ini file. (discussed elsewhere) This is how ODBC DSNs are mapped into database specific settings, such as the Oracle SID, port numbers and other client-server interaction and drivers.
The Oracle Enterprise Manager (OEM) is used to examine and change Oracle system settings. This is a graphical tool and must be run on a windowing environment such as Windows, X-Windows or on the Mac. If OEM is not installed on the machine running K2, or if the K2 machine is only accessed in a character mode such as via telnet, it can be run from another machine where Oracle is installed. The OEM allows examination of the table and columns schema and contents. If for instance the spelling of a column name is wrong than the database can't be opened, and vspider gives a rather generic error which can also be caused by the wrong username or DNS.

Also the trace log should be activated in the ODSA, under the Tracing tab, or in the odbc.ini file. This log is useful for showing the complete SQL query and troubleshooting problems with vspider opening the database.
[ back to top ]

[ back to top ]

Step 4: Defining ODBC Data Sources

Step 4a: On Windows

As we mentioned, on Windows there is a tool called ODSA, the ODBC Data Source Administrator. This can be found under:

Start --> Control Panel --> Administrative Tools --> Data Sources (ODBC)

Fill in the blanks from the information you gathered in Step 2.

The ODSA has a "test a connection" button to check the ODBC credentials and drivers, and it allows you to change the SID, port numbers and other settings. On other operating systems such as Solaris this information is set in the odbc.ini file, which is discussed in the next section.

Step 4b: On Unix

On Unix you will need to define your data source in the odbc.ini file, and then setup environment variables to point to it.

Sample Oracle ODBC INI file: In the sample above, the bold text indicates the important lines that would need to be changed (or at least verified). The underlined text is what is likely to be site specific.

Although we've shown the default port for Oracle, this does seem to change for a fair percentage of sites.

The _ssol26 is for Solaris; if you are on Linux or some other flavor of Unix, that string would be different. That is generically referred to as _platform, or "underscore platform", where platform refers to the operating system.

Unix Environment Variables

For Verity to index data properly, there are some environment variables that need to be set correctly:

PATH
LD_LIBRARY_PATH
ODBCINI
- - - Some other variables that you'll often need - - -
VDKHOME
ORACLE_HOME
JAVA_HOME
CLASSPATH

The details of setting all these variables are beyond the scope of this article.

One final warning about Unix Environment Variables

When you login to an interactive shell to setup or debug problems, you may be getting a different set of environment variables than what automatic indexing scripts or web server processes see, because you may be running a different shell. This can happen when scripts are automatically run out of crontab, or started when the system boots or by inetd.

For example, you may use C-Shell with .cshrc to do your editing, but the nightly index scripts might be running in the Bourne shell with .login. We've also see a combination of some scripts running in Bash but the system picking up defaults from an unexpected .cshrc.
[ back to top ]

Step 5: Copy Template Style Files

Make a copy of the shipped ODBC style files to customize. In a standard installation these are in k2\common\styles\vgwodbc. For our example we've copied the vgwodbc directory to a directory emp_style in our sandbox and we'll modify those style files.
[ back to top ]

Step 6: Add Fields to STYLE.SFL and STYLE.UFL

First we add new collection fields to hold the database information. The rules for adding these fields are the same as for any other K2 project. The VdkVgwKey is already in the style.ddd, in step 8 we'll associate it with EMP_NO. So we need to add one field to hold the contents of the DB column JOB. We can just make it an alias of the existing varwidth "Title" field, already defined in the shipped style.sfl. Furthermore it's convenient to give each field the same name as the column. So in the style.sfl we need only to add a new alias directive to the Title definition resulting in:

varwidth: Title _sv
          /alias = FTS_Title
          /alias = dc:Title
          /alias = vdk:Title
          /alias = JOB     # added for ODBC

The other fields required for indexing all of the EMP columns are in the style.ufl of the linked set.
[ back to top ]

Step 7: Add the ODBC DSN into ODBCMON.CFG

The DSN and some parameters that control its use are set in the DsnLimits section of odbcmon.cfg. In our example the DsnName is EMP_TABLE and we want the SecurityField to be public so we've changed the sample section of odbcmon.cfg to:

## Sample DsnLimits Section
DsnLimits:
{
  DsnName: EMP_TABLE
  # MaxConnects: 9
  SecurityField: 0
  #    QuotedId: 0
}

Note "DsnLimits:" and the curly brackets are now uncommented; the other parameters we've left at their default values.
[ back to top ]

Step 8: Edit VGWODBC.CFG

The vgwodbc.cfg is where the indexed table and columns are specified by defining elements "gwview" and "gwfield"

Gwview has two parts:

The DSN
an alternate name used by K2.

We've set it to "EMP_TABLE EMPVIEW" The first gwfield is the primary key EMPNO with a modifier "gwpk" meaning the VdkVgwKey will be populated with it's value. Other needed database fields should be listed also--for now we add only JOB. So the edited lines in vgwodbc.cfg are:

gwview: EMP_TABLE EMPVIEW
{
  ## At least one field definition is required for each section.
  ## Each 'gwfield' becomes available for indexing and/or
  ## mapping to a VDK field.
  ## Setting up a primary key column
  gwfield:  EMPNO
    /gwpk
  gwfield:  JOB   #collection field
    /table=EMP
    /dbcolumn = JOB  #oracle db name
}

Notice we've uncommented the opening curly bracket after the gwview: line. At the bottom of the vgwodbc.cfg there are a few lines concerning the file "freesql_1.txt" Uncomment the ones starting with "freesql:" and the next-to-last closing brace, so the last few lines appear as

## In ODBC gateway 5.0 and beyond an external file holds the
     ## parts of a query required to make a meaningful SQL statement.
     ## This file must be present and configured for each table. Note,
     ## this file is always assumed to be in the style directory.
      freesql: freesql_1.txt
    }
   }
   $$

In step 11 we'll create freesql_1.txt.
[ back to top ]

Step 9: Populating Fields with VGWODBC.VGW

The fields such as JOB were created to store the contents of the database columns and they need to be assigned values. This is done by copy directives to the gateway for each field/column combination added to the vgwodbc.vgw. A copy for the VdkVgwKey isn't needed, and the sample vgwodbc.vgw already contains a line for the _SECURITY collection field which needs to be left in. For JOB we add the one line "copy: JOB JOB." So the changed section of the vgwodbc.vgw looks like this:

copy: _VgwAccessType  _SECURITY
        # oracle column --> collection field
        copy: JOB JOB

A copy statement is required for each field that is populated from the database. In the linked set we've added copy commands for all of EMP.
[ back to top ]

Step 10: Define the Virtual Document in STYLE.DFT

The style.dft is used to compose the virtual document that K2 uses for full-text search and viewing of each record. For a database it typically consists of fields printed in the virtual document.

The style.dft in the shipped vgwodbc style files has all the lines commented out except for matched curly braces, which should be left so. First we uncomment the VdkVgwKey line and using directives "field" and "constant" add JOB and some formatting. Thus the edited part of the style.dft appears as:

## constant keyword to add static strings and
     ## view controls to the virtual document.
     constant: " "

     ## VDK internal field.
     ## This field will be indexed as
     ## part of the document as well.
     field: VdkVgwKey
     constant: "\n"
     ## Database field
     constant: "JOB: "
     field: JOB
     constant: "\n"

The linked set style.dft has all the EMP fields.
[ back to top ]

Step 11: Define SQL Statement in FREESQL_1.TXT

When indexing, vspider submits a SQL statement to the server. The statement is contained in a text file with the default name of 'freesql_1.txt' residing in the source style directory ("emp_style" in our case). When the collection is created, the file is copied to the collection's style directory along with the other style files.

For our example, indexing all the records in EMP, the freesql_1.txt contains only the single line:

FROM EMP

Vspider prepends a SELECT to this. The complete SQL statement can be seen in the trace log. The name of this file is specified in the last line of the vgwodbc.cfg, and the line needs to be uncommented--done in step 9. The SQL statement must be on the first line of the file, a blank line at the beginning of this file results in a rather misleading error message E0-1514 " Cannot initialize table 'EMP_TABLE/EMPVIEW'... "
[ back to top ]

Step 12: Setup Database Authentication in CRED.DAT

Vspider needs the database username and password which it can read from an encrypted file whose default name is "cred.dat" created by the utility "savecred" located in the K2 bin directory.

An example command-line for our case is:

savecred -f cred.dat -t 0x4 -c user:scott  -c password:tiger -c 
		repository:DBMS:EMP_TABLE:0x0dbc0000:0x0dbc:0

(the previous command should be entered all on one line.)

The user name is "scott" and the password is "tiger" The repository name is the same as the DNS name "EMP_TABLE" The hexadecimal characters should be entered exactly as shown. It's convenient to put this command into a shell script or batch command. Cred.dat should then be put into the source style directory used by vspider (emp_style for us), so at collection creation time the cred.dat will be copied into the collection's style directory. To use this file vspider needs the -auth2 option shown in step 13.

Note that savecred will append onto an existing cred.dat so it might save some grief in the development stage to delete an existing cred.dat before running it. Using "savecred -help" gives a usage dump. For changing the location and name of this file see the line 'CredFile: "local cred.dat" ' near the top of the odbcmon.cfg.
[ back to top ]

Step 13: Start the Indexing with the VSPIDER Command Line Tool

The minimum vspider command-line plus verbose logging is:

vspider -verbose -odbc -auth2 "emp_style\cred.dat" -collection empcoll -style emp_style

(the previous command should be entered all on one line.)

This creates the collection "empcoll" in the current working directory using the "emp_style" style files and indexes the records returned by the SQL query. The " start" option isn't needed. Note we are pointing to the auth file in the source style files. This file must be in the collection's style directory for use by the low-level ODBC gateway code, but the vspider command line can point to a copy, say, in the CWD.

You should get a fair amount of output to the console showing each record indexed due to the -verbose option.
[ back to top ]

Step 14: Check the Results with the RCVDK Command Line Tool

When vspider's finished it should give the number of records indexed. To check this and view the virtual documents you can use the utility "rcvdk" in the K2 bin directory.

For detailed information on rcvdk please see the article "Command Line K2: rcvdk" from Vol. 3 No. 4, Summer 2006.

First, at the command-line (for our collection empcoll in the CWD) enter:

rcvdk empcoll

This gives a message about opening the collection and then a prompt. Then enter "s" (without the quotes) to do a null search returning all the records in the collection; for a new collection this should be the same number as the number indexed by vspider.

RC> s (enter, by itself runs a null search, matching all docs)
(returns a document count)
RC> r (enter, to see the first screen of results)

You can also search for a specific word:

RC> s clerk (enter)
(returns a document count)
RC> r (enter, to see the first screen of results)

Then you can look at the matching documents (which may scroll by very fast).

RC> v 1 (enter)
1:              odbc://EMP_TABLE/EMPVIEW/7934
                [ odbc://EMP_TABLE/EMPVIEW/7934 ] JOB: CLERK

The document view should reflect the style.dft and the data in each record as shown by SQLPLUS.

To quit rcvdk just use the q command, followed by enter.

RC> q (enter)

[ back to top ]

Step 15: Review and Links to Style Files

Below we've included links to a complete set of style files. These are a bit different than those used in the article, in that they show how to index all of the fields from Oracle's EMP sample table.

Windows: emp_style_allfields.zip: Zip format, CR-LF, 19.1 KB (19,595 bytes)
Unix: emp_style_allfields.tar: Tar format, LF, 66.0 KB (67,584 bytes)

[ back to top ]