1
votes

I am using Tesseract for OCR purposes and I have added few additional words into "fin.user-words" (I would like to avoid creating a new word list and replacing tessdata/fin.word-dawg with it). Now, I succeeded doing it in command prompt:

>tesseract image.png result -l fin TestConfig

where TestConfig (Tesseract configuration file located under .../tessdata/configs) supresses the system dictionaries and forces Tesseract to load my words:

load_system_dawg F
load_freq_dawg F
user_words_suffix user-words

ref: http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data

I am trying to replicate the above procedure of command line, in Java but it seems that Tesseract ignores the configuration options. Here is the part of the Java code I am using:

public static TestTesseract(BufferedImage image) {
        Tesseract instance = Tesseract.getInstance();
        instance.setLanguage("fin");
        instance.setTessVariable("load_system_dawg", "F");
        instance.setTessVariable("load_freq_dawg", "F");
        instance.setTessVariable("user_words_suffix", "user-words");
        try {
            String result = instance.doOCR(image);
            System.out.println(result);         
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
}

Below is the nearest question to mine I could find; however, I could not find setConfigs method:

instance.setConfigs(Arrays.asList("bazaar");

Forcing Tesseract to match pattern (four digits in a row)

1

1 Answers

0
votes

The setConfig method is new since Tess4J v1.4 (see doc).

instance.setConfigs(Arrays.asList("TestConfig");