I am trying to write a function that will take a jpg of a floorplan of a house and use OCR to extract the square footage that is written somewhere on the image
import requests
from PIL import Image
import pytesseract
import pandas as pd
import numpy as np
import cv2
import io
def floorplan_ocr(url):
""" a row-wise function to use pytesseract to scrape the word data from the floorplan
images, requires tesseract
to be installed https://github.com/tesseract-ocr/tesseract/wiki"""
if pd.isna(url):
return np.nan
res = ''
response = requests.get(url, stream=True)
if response.status_code == 200:
img = response.raw
img = np.asarray(bytearray(img.read()), dtype="uint8")
img = cv2.imdecode(img, cv2.CV_8UC1)
img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv2.THRESH_BINARY,11,2)
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
del response
del img
else:
return np.nan
#print(res)
return res
However I am not getting much success. Only about 1 in 4 images actually outputs text that contains the square footage.
e.g currently
floorplan_ocr(https://i.imgur.com/9qwozIb.jpg)
outputs 'K\'Fréfiéfimmimmuuéé\n2|; apprnxx 135 max\nGArhaPpmxd1m max\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nTOTAL APPaux noon AREA 523 so Fr, us. a 50. M )\nav .Wzms him "a! m m... mi unwary mmnmrmm mma y“ mum“;\n‘ wmduw: reams m wuhrmmm mm“ .m nanspmmmmy 3 mm :51\nmm" m mmm m; wan wmumw- mm my and mm mm as m by any\nwfmw PM” rmwm mm m .pwmwm m. mum mud ms nu mum.\n(.5 n: ma undammmw an we Ewen\nM vagw‘m Mewpkeem'
(and takes a long time to do it)
floorplan_ocr(https://i.imgur.com/sjxMpVp.jpg)
outputs ' '
.
I think some of the issues I am facing are:
- text may be greyscale
- Images are low DPI (appears to be some debate if this is actually important or if it the total resolution)
- Text is not formatted consistently
I am stuck and am struggling to improve my results. All I want to extract is 'XXX sq ft' (and all the ways that might be written)
Is there a better way to do this?
Many thanks.