Unable to extract content directly from scanned pdf using apache tika , but works fine when converted to jpg format

Question

I am unable to extract content from the below attached image in its pdf form however it works fine when I convert it into jpg format. My problem is I have a ton of scanned pdf's with multiple scanned pages inside them. I want to see if there is a direct way to extract content instead of the overhead of converting pdf's to jpg's and then extracting text. I followed the solution provided at link

pdf version of doc is pdfversion

My java version "1.8.0_112", tesseract 3.04.01, leptonica-1.74.1, libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8

pom.xml has

<dependencies>
    <dependency>
        <groupId>net.sourceforge.tess4j</groupId>
        <artifactId>tess4j</artifactId>
        <version>3.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>1.14</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.14</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.5</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-core</artifactId>
        <version>1.3.1</version>
    </dependency>
    <dependency>
        <groupId>net.java.dev.jna</groupId>
        <artifactId>jna</artifactId>
        <version>4.2.2</version>
    </dependency>
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>1.2.11</version>
    </dependency>
    <dependency>
        <groupId>com.levigo.jbig2</groupId>
        <artifactId>levigo-jbig2-imageio</artifactId>
        <version>1.6.5</version>
    </dependency>

</dependencies>

java code

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
        TesseractOCRConfig config = new TesseractOCRConfig();
        config.setTesseractPath("/usr/local/bin/");
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);
        pdfConfig.setExtractUniqueInlineImagesOnly(false);
        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(Parser.class, parser);
        FileInputStream stream = new FileInputStream(new File("path2pdf.pdf"));
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

but no use, please advice if I am doing something wrong here.

@TilmanHausherr [pdf version] (dropbox.com/s/arggwrul27xdsq5/example2.pdf?dl=0) — Trinadh Gupta
Thanks; you are using Tika 1.13. Please try with 1.14. (change all the 1.13 to 1.14 in your pom.xml). According to tika.apache.org the OCR is in 1.14. (I'm not a tika expert; I wanted to look at the PDF to see if there's anything weird - there isn't) — Tilman Hausherr
@TilmanHausherr tried just now, it didnt work on pdf but works fine on jpg — Trinadh Gupta

James Fry James Fry · Accepted Answer · 2017-02-15T12:25:06

The problem appears to be that Tika invokes tesseract (once it has validated that the binary exists and can be executed) without specifying the location of the tessdata directory in the environment if the configuration parameter is not explicitly set (it is likely this default works for some installations, but not on my Mac). The paths can be set explicitly as per the following:

      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setTesseractPath("/usr/local/bin");
      config.setTessdataPath("/usr/local/share");

This then yields the result expected (at least on MacOS X with tesseract installed via homebrew):

1 An Introduction to Conditional Random Fields for Relational Learning

Charles Sutton

Department of Computer Science University of Massachusetts, USA [email protected] http://www.cs.umass.edu/~casutton

Andrew McCallum

Department of Computer Science University of Massachusetts, USA [email protected] http://www.cs.umass.edu/~mccallum

1.1 Introduction

Relational data has two characteristics: ﬁrst, statistical dependencies exist between the entities we wish to model, and second, each entity often has a rich set of features that can aid classiﬁcation. For example, when classifying Web documents. the page’s text provides much information about the class label. but hyperlinks deﬁne a relationship between pages that can improve classiﬁcation [Taskar et al.. 2002]. Graphical models are a natural formalism for exploiting the dependence structure among entities. Traditionally, graphical models have been used to represent the joint probability distribution p(y, x), where the variables y represent the attributes of the entities that we wish to predict, and the input variables x represent our observed knowledge about the entities. But modeling the joint distribution can lead to difﬁculties when using the rich local features that can occur in relational data. because it requires modeling the distribution p(x), which can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance.

A solution to this problem is to directly model the conditional distribution p(y]x), which is sufﬁcient for classiﬁcation. This is the approach taken by conditional ran- dom ﬁelds [Lafferty ct al., 2001]. A conditional random ﬁeld is simply a conditional distribution p(ylx) with an associated graphical structure. Because the model is

Unable to extract content directly from scanned pdf using apache tika , but works fine when converted to jpg format

1 Answers