XPump is NIE's proprietary Content Mining and Entity Extraction toolkit
for compiling structured business data from unstructured documents,
web content and heterogeneous data sources.
XPump can parse specific facts from HTML, PDF, Office documents and
relational databases, and assemble that raw data into actionable
business objects.
NIE has used XPump in a number of consulting projects to solve
complex entity extraction problems from large datasets in a short time,
requirements that were not met by traditional ETL tools.
The technology is Java-based and fully threaded; it can be deployed
in a multi-CPU environment if performance is a concern.
It allows for iterated discovery of entities through the refinement and
evolution of complex entity definitions, validation constraints and
process rule-sets.
This data can then be sent to traditional databases, search engines,
document management systems and other enterprise repositories.
Questions about Your Application
This is the the type of information our Professional Services Group
needs to design and implement an XPump solution:
What is the problem you are trying to solve? And in what timeframe?
What are the elements or entities that you need to extract? Do you have an existing list of entities or schema? Is this data hierarchal in nature?
How valuable are partial results as long as we "know what we don't know?" In other words if we can reliably extract certain flavors of certain elements and tell you where we failed, are you able to have someone review the last 5-10-20% that we cannot process?
How are you solving the problem now? What can we do to add value to your current solution in a way that's evolutionary?
How many documents? How many are new or change each day, week, or month? Do we need to detect what's changed/updated or can you alert us?
What formats are the documents in? Do they contain any scanned or faxed images, or similar data that will need to OCR'd before text entities can be identified?
What format do you need the output to take? Is this data going into a database, a search engine, another application, and/or some other repository?
Are there security or login issues (e.g. SSL, cookies, SSO, etc...) that we need to be aware of? Please supply appropriate details.
Are you trying to index a dynamic or data-driven site?
Is there form data that needs to be supplied to gain access to sub-sites or records, AKA "deep content"
Can you identify words, phrases or entities that should not appear in a valid document? Do you have an existing thesaurus of import terms and abbreviations?
How do you plan to verify the output? What assistance or audit information can we supply to assist you? Do you have existing data validation rules?
Platform information: what OS (e.g. Windows, Linux, Solaris) is installed and what version of Java is available? What specific database(s), search engine(s) and/or data repositories do you use?
Need more information?
We'd love to understand more about your application
and specific business requirements.