Third-party OCR engines in the CRAFT system

17/01/2017 printable version

Cardinal Kft's examination into the viability of using third-party OCR engines in the CRAFT system has come to an end. As a result, several new ways have opened up to further exploit the advantages of the CRAFT system. These include searching the content of scanned documents, copying text sections from scanned documents, creating searchable PDF documents based on scans, and automatically retrieving information from less well structured documents.

Introduction

The examinations we started in the middle of last year concerning the use of third-party OCR engines are completed. We inspected two suppliers' products: Nuance's OmniPage Capture and ABBYY's Recognition Server and FineReader products. In addition to inspecting the accuracy and efficiency of these OCR engines, our primary objective was to establish how third-party OCR engines could be used in the CRAFT system and to identify the new services we could provide if we chose to use these engines.

Full-Text search

The system will provide full-page OCR processing of scanned documents, allowing free text search in the full document text. We found that the OCR engines make only a few errors when recognising text, so when used for general purposes, their output can be relied on even without verification or correction. Under critical circumstances, when error-free text recognition is absolutely necessary, OCR engine performance can be improved through the verification/correction interface integrated into CRAFT. This interface helps with correcting recognition errors by marking and easy positioning to them.

Searchable PDF

The OCR engines make it possible to store and export scanned documents in a so-called searchable PDF format. An advantage of this format is that the user can see the scanned image with all its details but there is also an additional layer which displays the recognised text below the image and allows selecting and copying sections of the text, or searching the content of the document images exported into the file system.

Processing document content

The proprietary OCR engine of our CRAFT system allows recognising information located at fixed positions on well-structured documents. One of such tasks is the processing of forms, a CRAFT system function used by several partners of Cardinal Kft. Relying on the performance of the above OCR engines, we can process the content of less well structured documents, in which the location of specific information may change depending on content, or some pieces of information can only be identified based on their context. Such documents include invoices, the CRAFT processing of which is currently in the development phase.

 
 
There are no related items.