Search this site:
Enterprise Search Blog
« NIE Newsletter

Using the EXTRACT Utility

Last Updated Mar 2009

By: Mark Bennett & Miles Kehoe, New Idea Engineering, Inc., Volume 2 - Number 5 - April 2005

Do you ever wonder what data is actually stored in your Verity k2 collections? Oh, sure, you probably have a list of the documents which you submitted for indexing; and if you're like a few of the customers we work with, you even analyze your index logs to determine which documents did not successfully index. And you may even know why they didn't index. But seriously: do you know exactly what content you have in all of the fields of your collections?

Search data quality - SearchDQ - is finally beginning to get the attention it deserves. New Idea Engineering and others have SearchDQ products and services. But Verity provides a little known and virtually undocumented tool with K2 that gives you the power to look into your own collections to start doing your own search data quality review.

The tool? Extract.

Extract is a command line tool you will find in the Verity "bin" directory. On Windows, the executable is called "extract.exe"; on Unix, it's simply an executable called "extract". Note that the "extract: utility is not the same as the Extrator application recently introduced by Verity.

With extract, you can - you guessed it - extract any or all of the fields in your Verity collections, with or without the actual document stream. The output format is logical for Verity: extract dumps the data into a standard bulk insert file, which makes it easy to recreate a collection.

By default, extract dumps all fields, or rows, in the collection. But you can specify a subset of rows by providing an optional Verity Query Language (VQL) command. Extract also lets you provide credentials so your secure documents remain secure.

Running Extract

The primary syntax for the most common options is:

extract option filename collection

The valid values for option are as follows.

  Option Meaning
 
-x

extract all fields and the document text stream
  -n extract all fields but not the document text stream
  -v extract all non-gateway fields and the document text stream

For all three options, the <filename> parameter is the file name component of the file to contain the resulting output. For example, is you specify a <filename> of "colldump", all three options will create a file called colldump.vdk that contains the fields dumped in BIF format. In the two options that also extract the document text stream, you will have a file called colldump.0 which will contain the streamed text from all of the documents concatenated together (more on this shortly). Warning: Extract appends records to the vdk output file every time you run it; so if you run the same extract command twice, you'll find twice as many records as you'd expect in the output vdk file!

The -v option also lets you specify a 'verbosity level' that corresponds to the verbosity level for other Verity collection tools like mkvdk and vspider. This additional data,. dumped to standard error when you run extract, can help you identify which filters and code pages extract is using.

Fielded Data Only

Let's start with the simplest case: using extract to extract fielded data only.

extract -n data mycoll

When you run this command, extract will create a file called data.vdk; when you examine that file in a text editor, you'll see data in for form:

field1: data from 'field1' in document 1 
field2: data from 'field2' in document 1
<<EOD>>
field1: data from 'field1' in document 2
field2: data from 'field2' in document 2
<<EOD>>

In your file, you will actually see a list of all of the system and user fields defined in the style files that are defined in your style files. The text <<EOD>> is used to break logical records, or rows, of the collection.

Figure 1 shows the fields in a distinct record in a typical (small) collection. Your actual fields may differ.

VDKFEATURES: POLICIES, VACATION SCHEDULE, HOLIDAYS 
VDKSUMMARY: HR Policies and procedures for vacation schedules and holiday schedule
for 2005
VDKPBSUMMARYDATA:
VdkVgwKey: docs\hr_2005.pdf
Title: Human Resource Procedures
Subject:
Author: Miles Kehoe
Keywords:
MIME-Type: application/pdf
VLang:
Charset: utf8
To:
NewsGroups:
PageMap: 0x
Size: 0
Created:
Date:
URL:
<<EOD>>

Figure 1: Typical Logical Document/Row

Data and the Document Stream

Sometimes - perhaps more rarely - you may want to extract the text of the document as well as the fields. Why? Well, perhaps some of your documents are image files from which you've asked Verity to extract text; or you may have indexed PDF documents and you want to see the text stream. By the way, this is a very easy way to view the text form your indexed PDF files!

Consider the command line:

extract -x data mycoll

Extract behaves like the -n option; but you will see gateway specific fields that you do not see with the '-n' option; and you will see a file called data.0 which will contain the text of all of your documents merged together into a single file. (If you have very large sets of document, extract will split the text into files called data.0, data.1, data.2, etc. )

You might wonder why extract puts all of your documents into the same file. How can the collection building tools like mkvdk recreate the collection?

Well, if you use the -x option and look at the data.vdk file, you will see three fields that are rarely used in K2; but which have been part of Verity products since the beginning:

DOC_FN: "data.0"
DOC_OF: 0
DOC_SZ: 8594

These three fields provide the extract data file name (DOC_FN); the offset (DOC_OF) within the document where the text for the original document begins; and the number of bytes (DOC_SZ) for the text of the original document. Given these fields, the Verity indexing tools can identify the text associated with each document when you decide to reindex the data using this bulk insert file..

Note that you can use these fields yourself in any collection you may create using tools like mkvdk for indexing content such as Unix mail files, newsgroups, and other composite files.

Extracting Data with Verbosity

A less used capability of extract is its ability to show you the filters, gateways, the document character, and other indexer metadata. To see exactly what your documents were indexed with, specify a verbosity level when you use the "-v' option. The syntax of the command is

extract -v LEVEL data mycoll

The values for LEVEL, related to OUTLEVEL in the Verity K2 command line indexing tools documentation, are as follows:

  Level Meaning Provides
 
8

Status

Reports Query Time
  16 Info Reports drivers, character sets, Language
  64 Debug Reports Complete Information

These numbers are cumulative, so to see status and info levels, specify a LEVEL of 24 (status + info).

The '-v' option behaves like the '-x' option in that it creates a file (data.0 in this example) that contain the document text stream; but it shows only the user and system files like the '-n' option.

Other Details

If you run extract with the /? command, you will see a usage report that looks very much like some of the other utility tools such as BROWSE and DIDUMP. However, our experience is that the command line options such as LOCALE have to be specified as they do in mkvdk rather than as documented in the usage. For example, if you have created a collection with the new uni locale, you will need to run extract using the option:

extract -locale uni -n data mycoll

Selecting Subset Data

When you don't want to dump all rows of a collection, you can append a valid VQL statement to the end of the extract command. To find those rows authored by someone named Smith, you can use the command:

extract -n smithdata mycoll "Author<contains>Smith"

Note in Windows, you must enclose the query in double quotes so the VQL syntax doesn't confuse the operating system. On Unix systems you can use single quotes.

You can use as simple or complex a query as you want; and you can even specify zone names and operators in your query.

Security

If your collections use document level security, you can specify user name, password, domain, and for Notes or Exchange gateway a user mailbox. The syntax for dumping secure documents is:

extract -x dump mycoll -credentials miles:xyz

Note that security is used primarily for document access; if you have file system access to a collection, you can dump field data without any access control.

Finishing Up

The extract utility lets you take a look at the data actually stored in your K2 collections, and has been a little-known utility in the product for years. Extract will let you "roll your own" data analysis to identify what has actually been indexed. If you want to look at the role of Search Data Quality in your organization, give us a call at New Idea Engineering at 408-446-3460.