0
votes

I am trying to write OCR code using spark and pytesseract and I am running into pytesseract module not found error even though pytesseract module is installed.

import pytesseract
from PIL import Image


path='/XXXX/JupyterLab/notebooks/testdir'
rdd = sc.binaryFiles(path)

rdd.keys().collect()
-->['file:XXX/JupyterLab/notebooks/testdir/copy.png']

input=rdd.keys().map(lambda s: s.replace("file:",""))

def read(x):
    import pytesseract
    image=Image.open(x)
    text=pytesseract.image_to_open(image)
    return text

newRdd= input.map(lambda x : read(x))
newRdd.collect()

"On newRdd.collect() I get following error"

ModuleNotFoundError: No module named 'pytesseract'at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$16.apply(RDD.scala:960) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2111) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:420) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

I am not sure how can I pass the rdd.key() which holds the image path to pytesseract.image_to_String() using Image.open().

Thank you.

1
What will happen if you only use pytesseract and tesseract.image_to_stromg() in a new .py file?jizhihaoSAMA
Hi ,I tried that in jupyter notebook and it produces the result by converting image to string. I am trying to use pytesseract with spark to leverage the parallelism.user2844511
So the problem is about spark.I don't know much about spark.Maybe you can change the config of spark.jizhihaoSAMA

1 Answers

0
votes

My error was resolved by adding

sc.addPyFile('/pathto........../pytesseract/pytesseract.py')