Poor Man's Entity Extraction in Verity K2

« NIE Newsletter

Rediscover the Poor Man's Entity Extraction in Verity K2

Verity has released its Extractor product to provide some really interesting capabilities such as identifying and extracting content into specific fields and zones. Some old timers with Verity remember the 'TDE' capability that provided similar although less sophisticated capabilities years ago. For those of you who have forgotten about TDE, and for those of you who have started using Verity in the last few years and have not used it, you'll be glad to hear the TDE capability is still included in K2; the Collection Reference Guide still documents the capabilities even in K2 6.0.

The bad news is that none of the sample style sets use TDE at all. The good news is that if you want to try a "poor man's entity extraction", this article will show you how.

Background

From its earliest version, Verity featured the capability to identify and extract field values based on regular expressions. Back in those pre-web days, documents often had what we would now call metadata embedded in the text. Consider the simple document shown in Figure 1.

Title:	Meeting next Tuesday
	Author:	Milo 
	Dept:	Marketing
	Date:	July 3, 1995

	Let's meet next week to discuss the search project. We need to identify 
	features and budgets. If you cannot make it at 2:00PM July 19 call me.
	You can reach me at (408) 828-4592. See you then!

Figure 1: Simple Document

Now some of you old-timers might point out that the zone filter will do much of the extraction you might want when the document is so structured. But even the zone filter won't let you extract the phone number in the document shown in Figure 1. For that, you need to go back to the style.tde features.

Introduction to TDE

The acronym TDE stands for 'Topic Document Entry', the name of a product that processed and extracted fielded data based on regular expressions and would provide an interactive review capability. Now you real old timers will remember that TDE was really a reincarnation of the old "DMV" file, which let you describe patterns using regular expression syntax (regex) to identify metadata to be extracted at index time. (If you are a true old timer and think you remember how the file came to have the DMV extension, email me and tell me your guess. Perhaps I'll publish the most creative - and tell you all the correct answer.)

Style.TDE

Enter Style.TDE. It was first introduced in the 1995 timeframe, and as I have suggested, still works today. Given the text format shown in Figure 1 above, let's assume you want to extract the title, the author, the author's department, the date, and, if present, the phone number. To make it interesting, let's also assume that any document that has no author should not be indexed at all; and let's decide that if there is no department present, the document should be marked as 'Corporate'.

First, we need to identify fields that we want to use in the style.sfl or create new user defined fields in the style.ufl file. (Granted this is only a suggested spot to put user fields, but it's nicer to play within the Verity suggestions). I'll assume you know how to create fields called PHONE, EMAIL, and DEPARTMENT, and we'll simply use the standard fields TITLE, AUTHOR, and DATE. Note that by using standard fields. there will be some consequences later on as we index Office and PDF documents.

Note: You can download a ZIP file that contains the files describes in this article which you can download, unzip, and use as you examine this article. Go to http://www.ideaeng.com/pub/files/tde.zip and unzip the files into a working directory.

Next, create a test area where you can create collections without impacting your production environment. We would suggest you create a directory somewhere out of the way, as long as the Verity binary directory is in your system path. In this example, we create a "TDE" directory in the root of the C: drive, and create a style and docs directory under the root. We'll also create a file called flist that will contain the names of the files we want to index, as shown in Figure 2.

c:\tde
		c:\tde\style\
		c:\tde\docs\
		c:\tde\flist

Figure 2 - Suggested Directory Layout

Generally, we recommend you copy master style files from the Verity common styles directory into your new working style directory.

Put your test documents in the docs directory; and create flist that contains the names of the files to include. In this sample, flist contains the files from the docs directory as shown in Figure 3

docs/doc_one.txt
	docs/doc_two.txt
	docs/doc_three.txt
	docs/doc_four.txt
	docs/doc_five.txt
	docs/doc_six.doc
	docs/doc_seven.pdf

Figure 3 - flist Contents

Why not give the fully qualified path? Well, if you specify fully qualified paths, you will always need to have the documents for this collection in c:\tde\docs. If you use relative paths, you can move this entire tde directory around anywhere you ant and the collection will work fine.

Now, in the style directory, create a new text file called style.tde that matches the one shown in Figure 4

# style.tde to parse title, author, phone numbers and email address

	$control: 1 
	tde: 
	{ 
	pre-process: 
		{ 
		datamap: 
			{ 
		#
		# Define macros for three and four byte strings of numbers
				define: TRIDIG		"[0-9][0-9][0-9]"
				define: FOURDIG		"[0-9][0-9][0-9][0-9]"
		#
		# And define the image of a 'phone number'
				define: PNum "[(]{TRIDIG}[)][ \t]*{TRIDIG}-{FOURDIG}"
	
		#
		# Now define the field patterns
				field: Title PATTERN "^Title: <.*>" 
				field: Author PATTERN "^Author: <.*>" 
					/required = yes 
				field: Phone PATTERN "{PNum}"
					/which=ALL
					/string-between = " , "
				field: Email PATTERN "[ \t]*[A-Za-z0-9]+[\@][A-Za-z0-9\.]+"
				field: Department PATTERN "Dept: <.*>" 
					/default="Corporate"
				} 
				
		} 
	} 
	$$

Figure 4 - style.tde

Let's take a look at this style file one section at a time. Once we have defined the syntax requirements so K2 can recognize the file, we define three 'macros'. These macros provide an easy way to name patterns we intend to use throughout our regular expressions. The three we define here are:

		# Define macros for three and four byte strings of numbers
				define: TRIDIG		"[0-9][0-9][0-9]"
				define: FOURDIG	"[0-9][0-9][0-9][0-9]"
		#
		# And define the image of a 'phone number'
				define: PNum "[(]{TRIDIG}[)][ \t]*{TRIDIG}-{FOURDIG}"

Because we will be looking for phone numbers, we want to be able to easily recognize three and four digit strings of number. In the above macro definitions, we are defining in regex terms what a three digit sequence would look like (TRIDIG) and what a four digit sequence would look like.

Furthermore, we put these initial macros together to define yet another macro: the format of a fully qualified (North American) phone number of the format (xxx) yyy-zzzz. The space between the area code and the exchange number can be one or more space or tabs; but the hyphen is required between the exchange the number. The pattern for our fully qualified phone number we will refer to as PNum later is:

[(]       :  A single character, a left parenthesis
{TRIDIT}  :  Based on our earlier macro, a three digit sequence of numbers
[)]       :  A single charavter, a right parenthesis
[ \t]*    :  A single space character or a tab (\t), repeated zero or more times (*)
{TRIDIT}  :  A three digit sequence of numbers representing the phone exchange
-         :  The require hyphen between exchange and number; and finally
{FOURDIG} :  The four digit macro above representing a sequence of four numeric characters

Now that we have the macros defined, let's get to work on the fields themselves:

field: Title PATTERN "^Title:<.*>" 
				field: Author PATTERN "^Author:<.*>" 
					/required = yes 

				field: Department PATTERN "Dept:<.*>" 
					/default="Corporate"

These three fields all use the same general pattern: A line starting with the text label; a colon; then everything after that colon goes into the field. For example, the first line is read as "When the line starts with 'Title:' everything after the colon should be extracted into the field TITLE. Here's the pattern breakdown:


^         :  Start of line 'anchor' such that the text must start in column 1
Title:    :  The text to match in the document.
<         :  Mark the start of data to be extracted
.*        :  Zero or more (*) of any character up to the end of line (CR/LF on Windows
	 	or LF on Unix)
>         :  Mark the end of data to be extracted

If we omitted the angle brackets < and > from the pattern, we would get everything on the line, including the label - for example, 'Title:' as well as the actual title. Note that the field names are case insensitive here; but the actual string patterns to match are case sensitive. Also, as with DEPARTMENT, the actual pattern does not need to be the same as the field name.

We've provided two TDE qualifiers here as well. We've arbitrarily decided that the AUTHOR field is required, so any documents we process that do not have a value for AUTHOR will be skipped during indexing. We've also decided that any document with no value in the DEPARTMENT field will get a value of "Corporate". This capability helps us handle poor content metadata, at least a bit.

Now we're ready for the next line, the one that extracts the phone numbers. It uses the macro we defined earlier to define the pattern:

field: Phone PATTERN "{PNum}"
					/which=ALL
					/string-between = " , "

What this is saying is any string of test that matches the pattern defined in the PNum macro should be extracted and stored in the PHONE field. What happens is more than one phone number is present? For that, we use the Verity TDE 'which' qualifier that says, in this case, we take all of them; and separate them by a comma. There are a number of options of how to process the fields; refer to the Verity Collection Building Guide for more details.

Finally., we look at the pattern for recognizing email. Because we don't use any macros, the pattern itself is a bit longer:

field: Email PATTERN "[ \t]*[A-Za-z0-9]+[\@][A-Za-z0-9\.]+"

This will look for any valid string of characters - letters and numbers - fo9llowed by a string of letters and numbers and period characters.

[ \t]*          :  A single space character or a tab (\t), repeated zero or more times (*)
[A-Za-z0-9]+    :  One or more characters in the range between A to Z., a to z, and 0 to 9
[\@]            :  A single @ character escaped with the backslash to avoid regex confusion
[ \t]*          :  A single space character or a tab (\t), repeated one or more times (*)
[A-Za-z0-9\.]+  :  One or more characters or periods up until the next character not in
			this list, likely a space

This will find strings like "mk@testcase.com' and 'jw@nie.ab.ca'.

Bringing it together If you set up the files as shown, or download the zip file, you are ready to create your collection. The zip file has a batch file to build the collection from the included sample documents; and a batch file to run rcvdk with a stored command line input. To do it all manually, simply enter:

mkvdk -collection phone -create -style style
	mkvdk -collection phone -extract @flist

The magic here is the 'extract' option. It tells mkvdk to extract fields as directed in style.tde as it populates the collection from the files listed in flist. Wen you run this on the zip file documents, you will see one of the documents is rejected because it has no AUTHOR. To test the collection, use the run.bat file included in the zip file, or run rcvdk manually:

x
fields author 15 email 18 department 10 title 32 doc_fn 20 phone 30
s
r

The output you should see looks something like those shown in Figure 5:

rcvdk  Verity, Inc. Version 5.5.0
Attaching to collection: phone
Successfully attached to 1 collection.
Type 'help' for a list of commands.
RC> x
Expert mode enabled
RC> fields author 15 email 18 department 10 title 32 doc_fn 20 phone 30
RC> s
Search update: finished (100%).  Retrieved: 7(7)/7.     /7.
Retrieved: 7(7)/7
RC> r  
Number  author          email         department title         doc_fn           phone
1:      No Author Given                               doc_seven.txt        docs/doc_seven.pdf
2:      Zack (MS WORD)  zt@ideaeng.com     Dev                             docs/doc_six.doc 
3:      Janet Smith                        Corporate  Test document five   docs/doc_five.txt
4:      Zack Taylor     zt@ideaeng.com     Dev        Test document four   docs/doc_four.txt
5:      Milo            mbk@ideaeng.com    Marketing  Test document three  docs/doc_three.txt   
	(408) 828-4592 , (408) 828-5555
6:      Jon Wilson      jw@ideaeng.com     Sales      Test document two    docs/doc_two.txt     
	(408) 446-3460
7:      Janet Smith     js@ideaeng.com     Sales      Simple document      docs/doc_one.txt     
	(800) 555-1212
RC> Done

Figure 5 - Search Results

Note that document 3 from Janet Smith has no email and no department; so the default department value "Corporate" was assigned during the indexing process. Note also that multiple phone numbers were found in document 5 from Milo, so both are listed.

Note in the results shown in Figure X that for Word and PDF documents, the TITLE field is empty and doe not show the default value you might expect. This happens because the automatic field extraction in Office and PDF documents overrides out TDE-defined pattern. If it's important that your TDE-defined pattern override the default document behavior, use alternate fields in your style.ufl file rather than the default ones in style.sfl.

Summary

Well, this is just a hint at the power you have available to you in the style.tde file. Refer to the collection building documentation for K2; and be aware that not all regex engines are exactly the same, so you may find Verity's patterns to be a bit different than what you are familiar with. IN any case, feel free to email us if you have and questions or problems with style.tde. We're happy to help spread the news!