OCR on floorplan screenshots with pytesseract and OpenCV

Question

I am trying to write a function that will take a jpg of a floorplan of a house and use OCR to extract the square footage that is written somewhere on the image

    import requests
    from PIL import Image
    import pytesseract
    import pandas as pd
    import numpy as np
    import cv2
    import io

    def floorplan_ocr(url):
    """ a row-wise function to use pytesseract to scrape the word data from the floorplan
    images, requires tesseract 
    to be installed https://github.com/tesseract-ocr/tesseract/wiki"""

    if pd.isna(url):
        return np.nan

    res = ''
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        img = response.raw
        img = np.asarray(bytearray(img.read()), dtype="uint8")
        img = cv2.imdecode(img, cv2.CV_8UC1)
        img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
            cv2.THRESH_BINARY,11,2)
        #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
        res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
        del response
        del img
    else:
        return np.nan

    #print(res)
    return res

However I am not getting much success. Only about 1 in 4 images actually outputs text that contains the square footage.

e.g currently floorplan_ocr(https://i.imgur.com/9qwozIb.jpg) outputs 'K\'Fréﬁéﬁmmimmuuéé\n2|; apprnxx 135 max\nGArhaPpmxd1m max\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nTOTAL APPaux noon AREA 523 so Fr, us. a 50. M )\nav .Wzms him "a! m m... mi unwary mmnmrmm mma y“ mum“;\n‘ wmduw: reams m wuhrmmm mm“ .m nanspmmmmy 3 mm :51\nmm" m mmm m; wan wmumw- mm my and mm mm as m by any\nwfmw PM” rmwm mm m .pwmwm m. mum mud ms nu mum.\n(.5 n: ma undammmw an we Ewen\nM vagw‘m Mewpkeem' (and takes a long time to do it)

floorplan_ocr(https://i.imgur.com/sjxMpVp.jpg) outputs ' '.

I think some of the issues I am facing are:

text may be greyscale
Images are low DPI (appears to be some debate if this is actually important or if it the total resolution)
Text is not formatted consistently

I am stuck and am struggling to improve my results. All I want to extract is 'XXX sq ft' (and all the ways that might be written)

Is there a better way to do this?

Many thanks.

Might be easier to identify the walls, scale, and units, and just do the computation yourself, no? ;) — Mad Physicist
I don't know why there would be any debate on whether low DPI would be important. It is important. If you look at the quality of the thresholded image, it's a miracle you get any text out of tesseract. Recommend higher DPI if you can get it and preferrably in a lossless format (PNG is often a good choice). For an image like this, lossless compression will still usually give a small file size. — bfris
Are you only trying to extract the "Approximate Gross Internal Area = 50.7 sq m / 546 sq ft" line? — nathancy
@bfris The debate seemed to be between DPI and resolution, as DPI is just a display instruction. I.e resolution is important but DPI is not. — Harvs
@nathancy yes that is the line, or more specifically '546 sq ft' — Harvs

bfris bfris · Accepted Answer · 2019-09-21T21:02:28

All of the pixelation around the text makes it harder for Tesseract to do its thing. I used a simple brightness/contrast algorithm from here to make the dots go away. I didn't do any thresholding/binarization. But I did have to scale the image to get any character recognition.

import pytesseract   
import numpy as np
import cv2

img = cv2.imread('floor_original.jpg', 0) # read as grayscale
img = cv2.resize(img, (0,0), fx=2, fy=2)  # scale image 2X

alpha = 1.2
beta = -20
img = cv2.addWeighted( img, alpha, img, 0, beta)
cv2.imwrite('output.png', img)  

res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
print(res)

Edit There may be some platform/version dependence on above code. It runs on my Linux machine, but not on my Windows machine. To get it to run on Windows, I modified last two lines to

res = pytesseract.image_to_string(img, lang='eng', config='remove-background')
print(res.encode())

Output from tesseract(bolding added by me to emphasize sq footage):

TT xs?

IN

Approximate Gross Internal Area = 50.7 sq m / 546 sq ft

All dimensions are estimates only and may not be exact meas ent plans are subject lo change The sketches. renderngs graph matenala, lava, apectes

ne developer, the management company, the owners and other affiliates re rng oo all of ma ther sole discrebon and without enor scbioe

jements Araxs are approximate

Image after processing:

OCR on floorplan screenshots with pytesseract and OpenCV

2 Answers