Alfresco OCR integration

/ / Blog, Document Management
Configuring Alfresco SSL by using Let's Encrypt
Alfresco Site Home Page: new addon available at our GitHub

Some projects do not require automated document management solutions as Ephesoft to ingest scanned documents.

For these cases, we have developed an Alfresco 5 addon by using a standard OCR Alfresco Transformer. Since text extraction processes vary in duration, we have chosen to apply this transformation in asynchronous mode, delegating conversion to searchable PDF to a dedicated queue background processes. This solution runs on different Linux servers (Ubuntu / CentOS / Mac OS), but is not available for Windows servers.

To obtain OCR results comparable to those of commercial solutions, we have integrated into our OCR all the classic stages of this process:

  • Page Separation
    • OCR software works best on an image of an individual page
    • As a result of this phase, a set of documents in PDF format is obtained
  • Identify formats
    • Depending on the type of the image embedded in the PDF (PNG, TIF, JPG …), different conversion parameters must be applied
  • Conversion to monochrome format (PBM)
    • The classic PBM format ensures higher efficiency of extraction OCR algorithms
    • There is no need to use higher resolution than 300 dpi: no better results will be obtained
    • As a result of this phase, a set of documents in PDF format is obtained
  • Horizontal adjustment, noise reduction and edge trimming in each of the pages
    • This operation is still performed on images in PBM format, refining the set of documents obtained at the preceding stage
  • OCR text extraction
    • Using a specialized language corpus will improve significantly results
    • This phase produces a set of text documents that contain the words identified in the PBM images
  • Rendering each page to PDF
    • From the PBM image and OCR text document, a PDF page is rendered with the correct size format
    • As a result of this phase, a set of documents in searchable PDF format is obtained
  • Final PDF composition
    • PDF pages obtained in the previous phase are inserted in the proper order to build a single searchable PDF document

Thus, despite using only Open Source tools to build this service, results are quite satisfactory.

Our Alfresco OCR addon is available at https://addons.alfresco.com/addons/alfresco-simple-ocr-action

Let keensoft show you how to apply intelligent OCR techniques to your organization.

Unidad de negocio, keensoft