RU | ENG

Modern society has done in the most part a step from paper to electronic documents. This provides a number of advantages. It is much easer to exchange, copy, edit and save documents today. Now it is time to make the next step - extract important data that the documents contain so that analytical applications can make use of this data.

Examples:

  • Financial Reports of public companies are available online, but only after extracting data from Balance Sheet, Income Statement and Cash Flow bank analysts can perform precedents based analysis and prognosis
  • Real Estate leases contain multiple data tables. Data extracted from the tables can be used by the large Real Estate companies for regional price control
  • Insurance companies receive piles of claims every day. Detailed data extraction is absolutely unavoidable part of the claim processing.
  • Every enterprise issues tons of documents of different types hiding valuable data important for optimal production control.

The process of manual data extraction is slow, error-prone and labor consuming. This is the main reason why pretty often a company leaves important data dying in the documents even when the use of it could make a huge difference to the business.

We offer a non expensive solution to the problem taking the burden of data extraction from you. You only need to tell us which data points you want to be extracted from every type of documents and then just send the documents to us with regular email or upload them using our Web Insert page and in several hours we will return a structure containing all the extractions. Together with the extractions we save links to the original document so that a client can always verify the correctness of extractions by checking out the part of the document surrounding it.

Pricing

Our clients have a choice of 3 plans:
1. Fixed price per data point - best for the newcomers who'd like to try our service or for those clients who aren't certain about the amount of extractions. We ask from 10c for a data point extracted from a several hundred pages document. You can compare this price to what you would pay in compensation to a human extractor who would have to look for this data item and send it to your data repository if you decide to organize the extraction process in house.
2. Fixed price per document - for the client whose documents contain large amounts of uniformly located data (sitting in tables for instance). If your documents are qualified for this plan by our experts you can save up to 30% comparing to the per data point plan.
3. Monthly subscription - for the clients who would prefer long-time relationships with us. Our sales department will evaluate your needs and give you a good monthly rate.

Technology

The core of our technology is the Data Extraction Platform that Evolutionary Software developed recently and used by several major Data Vendors. This system automates from 60 to 90% of the extraction work just in seconds and provides initial settings and tools for manual data cleanup by our data quality assurance team.

In short DEP is a collection of unique Text Mining and Document Processing solutions supported furthermore by a knowledge base in form of ontology of models. The models reflect semantic and formatting dependencies between elements of documents.

Second important element of our service is the Data QA team. We hire well educated experts in every application area who do final validation and cleanup of the extraction results after automatic extraction.

Workflow

There is no complicated workflow that your staff would need to learn. Some minor amount of work is required at the initiation stage though when you would have to tell us which data points you need to extract from the documents. You might need to take a couple of sample documents, highlight those data points that you'd like to be extracted, scan them and send to us.


Then you would only need to upload the source documents onto an FTP site of your choice. We will take the source document, process it and put the result to the same FTP site. Or even easier - email a source document to us and we will send you back the result of data extraction. Result can be in one of the standard forms (xml, excel, html). See samples of the results below:

www.ev-soft.com/doc2struct/xml/xml.xml
www.ev-soft.com/doc2struct/output.pdf
www.ev-soft.com/doc2struct/index/752601/index.html

Data Extraction Requirements

This is actually the only part of the process that requires client's involvement. We cannot decide for client what data should be extracted and how to name every piece of it. Initial data extraction requirements process goes through several iterations. First, the client makes simple manual highlighting of the data points of interest like in the chart below:


After you've sent the chart to our representative we will try to create an extraction template which is actually a tree structure serving as a placeholder for extracted data.


Then we make trial extractions and send them back to the client for fixes and approval. After several such iterations the requirements get approved and we start creating term models for automatic extractions and production processing.

Output Formats

There are several output formats for the resulting extraction representation. First of all this is the most common and IT oriented - XML containing named data extractions like in the structure below :



Such XML is useful for further automatic processing :

  • Downloading extraction results into a Data Base
  • Creating online reports
  • Transferring extraction results between different software systems


    One of the important features of our Data Extraction technology is ability to link particular extraction result to corresponding location in the initial document. It allows us to show the user not just an extraction result but the place in the document containing the data point. Thus the user doesn't have to trust our extraction results blindly but can rather check every data point just clicking on it in one of the visual formats.

    We offer 2 visual forms that can be returned to the client together with the XML structure.

    First is PDF similar to the one shown below. Left panel contains the resulting data tree whereas the right one represents initial document. User can just click on any data value in the left pain and the right pain will scroll to the extraction point in the document. Plus we would highlight the data location in the right pain for better visual recognition.



    Second visual form supporting similar behavior is HTML. We developed it for the users who don't have Adobe Acrobat installed on their workstations. It looks like the following :



    Here again user can click on a data value in the leftmost pane and the middle and right panes will scroll to the corresponding data locations and highlight the data in the document text.

    We are not limited with the listed output formats and can add any additional one on a client's request.

    How to make first step

    Just tell our sales representative about your needs: sales@ev-soft.com


  • webmaster@ev-soft.com
    Copyright © 2000
    Evolutionary Software, Inc.