1
votes

I am using Alfresco community 6.1.

I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)

Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).

As a first approche:

  • For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).

  • For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.

So my questions are :

  • How can I make the OCR results more accurate?

  • How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?

Im using pdfsandwich, and my alfresco-global.properties is:

ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux
2

2 Answers

2
votes

I'm afraid this question is off topic: https://stackoverflow.com/help/on-topic

Some input anyway:

  • I highly recommend to do all the ocr/classification/extraction outside / before storing the pdfs in Alfresco
  • The technical term for what you're looking for is: Document Capture If you really expect to classify your scanned docs and to extract the data for inbound documents (which you can't control in structure) the solutions are quite expensive and licensed per pages/period. Market leaders are Kofax and Abbyy in that area.
  • If you can control the document structure / if the structure of the document is fix you could use quite cheaper solutions which use something like a dynamic template approach (depending on found ancor points, barcodes, regex matches). We use PDFmdx for this to automate qualified extraction.
  • Everything depends on the OCR quality. My personal opinion: the free/open source ocr components can't compete with the commercial solutions if you don't have the time, exprtise and resources to train and optimize them. Abbyy has a quite affordable CLI solution for linux (ABBYY FineReader Engine CLI for Linux) but I'm sure there are others with similar results.
  • There is a quite nice and simple solution called AutoOCR which is a REST-/SOAP-Service providing a generic, configurable interface to use several ocr engines and configurations as a service. We implemented an Alfresco integration to act as an Alfresco Transformer but since the Alfresco Transformer framework is deprecated I'd recommend to do the whole ocr and recognition stuff before storing the documents in Alfresco
  • Finally: if it is a one time approach: Try to find a service provider doing at least the ocr and maybe also the classification/extraction.
0
votes

To answer your questions.

To improve OCR results you need to pre-process image. That includes noise removal, line removal, thresholding, etc. But none of them helps if the engine is not working precisely. Tesseract from version 4.0.0 is working well enough for most applications.

Your approach may work in some cases but it will not work great on a large set of invoices. I suggest using some of the invoice data extraction services. In that case, you don't need to worry about preprocessing and extraction itself. You could use:

Using such a service can save you a lot of headaches and time.

Disclaimer: I am one of the creators of typless. Feel free to suggest edits.