I am using Alfresco community 6.1.
I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)
Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).
As a first approche:
For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).
For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.
So my questions are :
How can I make the OCR results more accurate?
How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?
Im using pdfsandwich, and my alfresco-global.properties is:
ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux