new idea ENGINEERING         Home  | Products  | Services  | Newsletter  | Resources  | About Us | Contact Info | Privacy Policy        

  Specializing in Enterprise Search since 1996 - including FAST, Autonomy IDOL, K2, and Ultraseek, OmniFind and Lucene

XPumpTM Data Sheet - Version 1.5 - 12/15/2005

Summary of our XPump Technology

XPump is NIE's proprietary Content Mining and Entity Extraction toolkit for compiling structured business data from unstructured documents, web content and heterogeneous data sources. XPump can parse specific facts from HTML, PDF, Office documents and relational databases, and assemble that raw data into actionable business objects. NIE has used XPump in a number of consulting projects to solve complex entity extraction problems from large datasets in a short time, requirements that were not met by traditional ETL tools.

The technology is Java-based and fully threaded; it can be deployed in a multi-CPU environment if performance is a concern. It allows for iterated discovery of entities through the refinement and evolution of complex entity definitions, validation constraints and process rule-sets. This data can then be sent to traditional databases, search engines, document management systems and other enterprise repositories.

Questions about Your Application

This is the the type of information our Professional Services Group needs to design and implement an XPump solution:

  1. What is the problem you are trying to solve? And in what timeframe?

  2. What are the elements or entities that you need to extract? Do you have an existing list of entities or schema? Is this data hierarchal in nature?

  3. How valuable are partial results as long as we "know what we don't know?" In other words if we can reliably extract certain flavors of certain elements and tell you where we failed, are you able to have someone review the last 5-10-20% that we cannot process?

  4. How are you solving the problem now? What can we do to add value to your current solution in a way that's evolutionary?

  5. How many documents? How many are new or change each day, week, or month? Do we need to detect what's changed/updated or can you alert us?

  6. What formats are the documents in? Do they contain any scanned or faxed images, or similar data that will need to OCR'd before text entities can be identified?

  7. What format do you need the output to take? Is this data going into a database, a search engine, another application, and/or some other repository?

  8. Are there security or login issues (e.g. SSL, cookies, SSO, etc...) that we need to be aware of? Please supply appropriate details.

  9. Are you trying to index a dynamic or data-driven site? Is there form data that needs to be supplied to gain access to sub-sites or records, AKA "deep content"

  10. Can you identify words, phrases or entities that should not appear in a valid document? Do you have an existing thesaurus of import terms and abbreviations?

  11. How do you plan to verify the output? What assistance or audit information can we supply to assist you? Do you have existing data validation rules?

  12. Platform information: what OS (e.g. Windows, Linux, Solaris) is installed and what version of Java is available? What specific database(s), search engine(s) and/or data repositories do you use?

Need more information?

We'd love to understand more about your application and specific business requirements.

Email: sales@ideaeng.com
Toll Free: (866) IDEA - ENG   ( 866 - 433 - 2364 )
Outside the US: +1 (408) 446 - 3460

 


Home  | Products  | Services  | Newsletter  | Resources  | About Us  | Contact Info  | Privacy Policy
Copyright New Idea Engineering, Inc 1996 - 2008