J4L OCR tools for the Java[TM] Platform

J4L OCR Tools

Introduction

Documentation

Documentation

J4L OCR Tools

Introduction
- Requirements
J4L Java wrapper for Tesseract OCR engine 3.0
- Introduction
- How to install the software
- Running the test application
- How to use the classes in your Java programs
PDF to Text converter
- Introduction
- How to use the classes in your Java programs
J4L Document Parser
- Introduction
- How to create a document definition
- How to run the parser
The OCR servlet
The PDF converter servlet
Third party software

Introduction

The J4L OCR tools is a set of components that can be used in Java applications to recognize text within an image and parse such texts. This can be used to scan or fax business documents like purchase orders and extract the data from them. Another use case is to archive images and indexing them with data extracted from the content.

The tools are made of 2 main components:

The text generator which can be:
- The Java wrapper for the Tesseract OCR engine. This wrapper will be used for converting images to text using the Tesseract OCR engine.
- Or the PDF to Text converter which converts PDF files to text.
The document parser will extract the data from the text provided by the OCR engine or PDF to text converter.

Requirements

The Java components require Java 1.5 or later. If you are going to use the OCR engine Tesseract, it requires Windows.

J4L Java wrapper for Tesseract OCR engine 3.0

Introduction

Tesseract OCR is a free OCR engine sponsored by Google. The J4L Java Wrapper classes is a bridge that allows you to use the engine from your Java application. The current implementation will run on windows only, however it is possible to create a Linux version also, let us know if you have such requirement.

The benefits of the J4L wrapper are:

will abstract your Java application from the C/C++ details of the OCR engine.
it is very easy to use

How install the software

The zip file we distribute can use used directly after unzipping without additional setup. However if you use our classes in your own application you need to take this into account:

tess3WrapperDLL.dll and leptonlibd.dll must be located in the current working directory or in the system's path.
the subdirectory tessdata must be located in the working directory.
the files lib/jai_codec.jar and lib/jai_core.jar must be located in your classpath if you are going to read tiff files.
if you are going to process files in other language than english you must download the corresponding language file from:

http://code.google.com/p/tesseract-ocr/downloads/list

the files are called like XXX.traineddata.gz where XXX is the language. There is support for eng (english), spa (spanish), fra (french), deu (german), nld (dutch), ita (italian) and many other langauges. These files must be unzipped in the tessdata subdirectory of the working directory.

Running the test application

Our distribution includes a runOCRParserTest.bat file which will take the file order.png as input file and will output the content of the file as text after running the OCR engine. The code being executed is the file OCRTest.java.

How to use the classes in your Java programs

The usage of the classes is very simple:

Import the facade class:

import com.java4less.ocr.tess.OCRFacade;
create the facade object:

OCRFacade facade=new OCRFacade();
run the OCR by providing an input file and language

String text=facade.recognizeFile("Report.PNG", "eng");

this will return a text string. If the image is a multipage file the pages will be separated by the FF character (asii character 12) in the text string.

PDF to Text Converter

Introduction

Our component uses Apache PDBox to parse the PDF files and extract the text content. The extracted text will be formatted in the similar way as the text in the PDF file, that means the line a columns positions of the values will be similar.

Note however the text conversion works only if the PDF file contains text elements. For example, some faxes or scanner can create PDF files but these files contain just images of the scanned page and not text elements. The converter would not work on this kind of files.

How to use the classes in your Java programs

The usage of the classes is very simple:

Import the class:

import com.java4less.pdf.PDFToTextConverter;
create the converter object:

PDFToTextConverter conv=new PDFToTextConverter();
run the converter by providing an input file

String text=conv.convertToString(new FileInputStream("order.pdf"));

this will return a text string. If thepdf is a multipage file the pages will be separated by the FF character (asii character 12) in the text string.

J4L Document parser

The document parser will help you in extracting the information from the text returned by the OCR engine. The benefits of using our document parse are:

It provides a clear declarative interface to extract data (xml based).
It will save you from writing lot of plumbing code.
The document parser will also understand labels that are not correctly read by the OCR engine. For example , if you are looking for a label called "Total", but the engine reads "Totai", the document parse will still find the label.

Introduction

The document parser divides the data in sections, normally a document has a header section, a detail section that can be repeated and a footer section. The following screenshot shows a documents with 4 sections:

The header that starts on the top of the document.
The detail header.
The detail which can be repeated and has a hight of 1 line each.
The footer.

But how does the parser know where the sections are located? It uses 2 rules:

Some sections can be identified because they contains a certain label (we use the term label and text mark for this). For example, the detail header can be found by looking for the text "Number Article Description". What happens if the OCR engine reads the values "Nunber Articlo Descripton"? The parser will still find it.
Other sections can be identified because they are located after a fixed length section. For example, we know the detail header has length 1 line, and after that the detail section starts.

The other type of object in the document are labels (or text marks) which are constant values (all documents of the same type have the same label at the same position) and fields, which are variable values, normally located next to labels.

To summarize, labels are used to find fields and sections.

How to create a document definition

The document definition is a XML file that describes the sections of a document and the labels and fields. This XML file will be used at runtime to parse the text returned by the OCR engine, The XML's file root node is called <documentref> and his children are <section>. The following XML shows main structure of the example document used in the introduction:

<documentref>

<section name="header">
</section>

<section name="detailsheader" len="1">
</section>

<section name="detail" repeteable="true" len="1" >
</section>

<section name="footer" mandatory="false">
</section>

</documentref>

this defines the 4 sections of the document. As a general rule, sections are mandatory, they must exist in the document, non mandatory sections are allowed only at the end of the document.

Now we have to define how to find the sections:

The header section is the first in the document so it starts on the top of the document and requires no further definition
The detail header will start when we find the label "Number Article"
The detail section will start after the detail header which has a fixed length of 1 line so no further definition is required.
The footer section will start when we find the text "Tax:"

The tag <startlabel> will be used to define the label which identifies the section:

<documentref>

<section name="header">
</section>

<section name="detailsheader" len="1">
     <startlabel name="detailsheaderlbl">
                <value>Number Article<value>
     </startlabel>
</section>

<section name="detail" repeteable="true" len="1" >
</section>

<section name="footer" mandatory="false">
     <startlabel name="detailsheaderlbl">
                <value>Tax:<value>
     </startlabel>
</section>

</documentref>

Now the structure of the document has been defined. The next step is defining the fields we want to extract and the labels we will use to locate the fields. In this example we will read 3 fields

The purchase order number in the header. The purchase order number is located right to the label "Number:"
The article number and quantity from the detail section. The article is located as second field in each line, and the quantity is the third element starting from the left.

In each section you use the <label> tag to define label and the <field> tag to define fields. The fields have 2 positions, x and y (line). Each position has a reference to a label or another field and the directions how to find the field.

<documentref>

<section name="header">
    <label name="numberlbl"> *** define the label Number: ****
            <value>Number:</value>
    </label>
    <field name="numberValue" mandatory="true" type="S" format="[0-9]{10}">
        <x>
            <reference>numberlbl</reference> *** this field is located next to the numberlbl ****
            <direction>RIGHT</direction>
            <distance>1</distance>                    *** the field is 1 word to the right ****
        </x>
        <y>
            <reference>numberlbl</reference>      *** the number label us used to find the line of the field****
            <direction>UP</direction>
            <distance>0</distance>                      *** the field is in the same line as the reference ****
        </y>
    </field>
</section>

<section name="detailsheader" len="1">
     <startlabel name="detailsheaderlbl">
                <value>Number Article<value>
     </startlabel>
</section>

<section name="detail" repeteable="true" len="1" >

    <field name="articleValue">
        <x>
            <reference>BeginOfLine</reference> *** this field is the second word from the beginning of the line ****
            <direction>RIGHT</direction>
            <distance>2</distance>
        </x>
        <y>
            <reference>BeginOfSection</reference>      *** this field is located in the first line of the section****
            <direction>DOWN</direction>
            <distance>0</distance>
        </y>
    </field>

    <field name="quantityValue">
        <x>
            <reference>EndOfLine</reference> *** this field the third word from the end of the line ****
            <direction>LEFT</direction>
            <distance>3</distance>
        </x>
        <y>
            <reference>BeginOfSection</reference>      *** this field is located in the first line of the section****
            <direction>DOWN</direction>
            <distance>0</distance>
        </y>
    </field>
</section>

<section name="footer">
     <startlabel name="detailsheaderlbl">
                <value>Tax:<value>
     </startlabel>
</section>

</documentref>

It is possible to read a field at a fixed position as shown below, the delivery date will be located in line 3 at column 60:

<field name="DeliveryDate" type="D" format="dd/MM/yyyy" mandatory="false">
    <x>
        <reference>BEGINOFLINE</reference>
        <direction>RIGHT</direction>
        <distance>60</distance>
        <useColumnPosition>true</useColumnPosition>
    </x>
    <y>
        <reference>BEGINOFSECTION</reference>
        <direction>DOWN</direction>
        <distance>3</distance>
    </y>
</field>

If you want to use the absolute position of the field instead of using associated labels, you must first convert the PDF or image to text in order to find out what the position of a field will be. This approach should be used only with PDF files and setting the property setPreserveSpaces() of the class PDFToTextConverter to true.

Error handling

The parser can detect the following kind of errors:

Missing mandatory sections. The default value is, all sections are mandatory.
Missing mandatory field. The default value is, all fields are mandatory.
Field format error. Each field can have a type which is string (default), numeric or date.
- For string fields you can use a regular expression to define the expected format, for example, the regular expression [0-9]{10} means a string made of 10 digits. The regular expressions are those supported by java.util.regex.Pattern. For example:
  
  <field name="Number" type="S" format="[0-9]{10}">
- For date fields, the format can be any supported by the class java.text.SimpleDateFormat.
  
  <field name="DeliveryDate" type="S" format="dd/MM/yyyy" mandatory="false">
- For numeric fields, the format can be any supported by the class java.text.DecimalFormat.
  
  <field name="Quantity" type="N" format="####0">

the format attribute is required only if you want the parser to check the format for the value, otherwise do not add the format attribute to your field.

How to run the parser

In the previous sections you have learnt how to create a document definition in XML format. In the wrapper section we showed how to use the OCR engine to obtain a String out of an image file. The next step is to use our Java classes to parse the obtained String using the document definition XML file.

The steps to do this are:

Create a DocumentDef object and load your XML file as follows
DocumentDef docDefinition=new DocumentDef();
docDefinition.loadFromXml("purchaseorderDefinition.xml");
Create a Parser object and parse the data String:
Parser parser=new Parser(docDefinition);
DocumentSet docSet=parser.parse(data); // the variable data is the string returned by the OCR engine
once the data has been parsed, you can obtain the sections and fields

Document doc=docSet.getDocument(0);
Section header=doc.getSectionByName("header")[0];
String number=header.getField("numberValue");
You can check if the are errors in the document by calling the method doc.hasError(). If it returns true you use the method doc.getErrors() to get the list of errors, which can be any of these: SectionMissingException, FieldMissingException or FieldFormatException.

Exporting data to XML

The data read by the Parser can be exported to XML by calling documentSet.toXml(). The output will look like this:

<Set> root element
    <PurchaseOrder> name of the document
        <order_header> section
            <Number>4500005693</Number> field
            <DeliveryDate>07/02/2001</DeliveryDate> field
        </order_header>
        <col_header/>
        <items_detail> section
            <Article>R-5000</Article>
            <Quantity>111.0</Quantity>
        </items_detail>
        <items_detail> section
            <Article>R-3456</Article>
            <Quantity>1.0</Quantity>
        </items_detail>
        <order_footer/> section

        <Error field="Number" section="order header" sectionRepetition="1">FieldFormatError</Error>
        <Error field="DeliveryDate" section="order header" sectionRepetition="1">FieldFormatError</Error>
    </PurchaseOrder>
</Set>

The root element is <Set>, followed by the document name element whose children are the sections. Within each section, the elements are the fields of the section. After the sections there could be only or more <Error> elements which reports missing sections, missing fields or format errors.

The OCR Servlet

The product includes a servlet which takes as input an image file, runs the OCR engine and parses the text data. The result returned by the servlet will be the XML data of the document.

The servlet has to be installed on Tomcat for Windows like this:

copy the file J4LOCRServer.war to the tomcatdirectory\webapps directory.
copy tess3Wrapper.dll and leptonlibd.dll to tomcatdirectory\bin directory.
copy the tessdata subdirectory to tomcatdirectory\bin directory.
start tomcat

the servlet can be tested by openning this URL:

http://localhost:8080/J4LOCRServer/Example.html

this opens a form so that you can upload the file order.png to the servlet.

The servlet URL is /J4LOCRServer/OCRServer and it requires the following parameters as part of the URL:

DEFINITION parameter: used to set the document definition file used to parse the document, in our example it is ordedef.xml. This file must be located in the directory webapps\J4LOCRServer\WEB-INF\classes.
Set DATAFIELD=YES if the image is going to be uploaded using a HTML file. If this paramter is missing the Servlet must be called using the POST method and sending the image data.

The PDF Converter Servlet

The product includes a servlet which takes as input an PDF file, runs the PDF converter and parses the text data. The result returned by the servlet can be text or XML data of the document.

The servlet has to be installed on Tomcat for Windows like this:

copy the file J4LOCRServer.war to the tomcatdirectory\webapps directory.
start tomcat

the servlet can be tested by openning this URL:

http://localhost:8080/J4LOCRServer/ExamplePDF.html

this opens a form so that you can upload the file order.pdf to the servlet.

The servlet URL is /J4LOCRServer/PDFConvServer and it requires the following parameters as part of the URL:

DEFINITION parameter: used to set the document definition file used to parse the document, in our example it is ordedef.xml. This file must be located in the directory webapps\J4LOCRServer\WEB-INF\classes. If this parameter is missing the text will be returned (instead of the parsed content as XML).
Set DATAFIELD=YES if the image is going to be uploaded using a HTML file. If this paramter is missing the Servlet must be called using the POST method and sending the image data.

Third party software

Our component uses the Tesseact OCR engine, Apache Xalan, Apache Xerces and Apache commons which are distributed under the Apache 2,0 license and the Sun Java Advanced Imaging API, used to read tiff files.

The PDF to Text converter uses the Apache PDBox library, the Apache Jempbox library and the Apache Fontbox library.


Copyright © 2000-2018 Java4Less.com. About us. Oracle, APEX, Java, JSP, JDBC, JDK and all Java-based marks are trademarks or registered trademarks of Oracle and/or its affiliates. J4L Components is independent of Oracle.