3
votes

I am trying to write a function that will take a jpg of a floorplan of a house and use OCR to extract the square footage that is written somewhere on the image

    import requests
    from PIL import Image
    import pytesseract
    import pandas as pd
    import numpy as np
    import cv2
    import io

    def floorplan_ocr(url):
    """ a row-wise function to use pytesseract to scrape the word data from the floorplan
    images, requires tesseract 
    to be installed https://github.com/tesseract-ocr/tesseract/wiki"""

    if pd.isna(url):
        return np.nan

    res = ''
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        img = response.raw
        img = np.asarray(bytearray(img.read()), dtype="uint8")
        img = cv2.imdecode(img, cv2.CV_8UC1)
        img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
            cv2.THRESH_BINARY,11,2)
        #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
        res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
        del response
        del img
    else:
        return np.nan

    #print(res)
    return res

enter image description here

However I am not getting much success. Only about 1 in 4 images actually outputs text that contains the square footage.

e.g currently floorplan_ocr(https://i.imgur.com/9qwozIb.jpg) outputs 'K\'Fréfiéfimmimmuuéé\n2|; apprnxx 135 max\nGArhaPpmxd1m max\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nTOTAL APPaux noon AREA 523 so Fr, us. a 50. M )\nav .Wzms him "a! m m... mi unwary mmnmrmm mma y“ mum“;\n‘ wmduw: reams m wuhrmmm mm“ .m nanspmmmmy 3 mm :51\nmm" m mmm m; wan wmumw- mm my and mm mm as m by any\nwfmw PM” rmwm mm m .pwmwm m. mum mud ms nu mum.\n(.5 n: ma undammmw an we Ewen\nM vagw‘m Mewpkeem' (and takes a long time to do it)

floorplan_ocr(https://i.imgur.com/sjxMpVp.jpg) outputs ' '.

I think some of the issues I am facing are:

  1. text may be greyscale
  2. Images are low DPI (appears to be some debate if this is actually important or if it the total resolution)
  3. Text is not formatted consistently

I am stuck and am struggling to improve my results. All I want to extract is 'XXX sq ft' (and all the ways that might be written)

Is there a better way to do this?

Many thanks.

2
Might be easier to identify the walls, scale, and units, and just do the computation yourself, no? ;)Mad Physicist
I don't know why there would be any debate on whether low DPI would be important. It is important. If you look at the quality of the thresholded image, it's a miracle you get any text out of tesseract. Recommend higher DPI if you can get it and preferrably in a lossless format (PNG is often a good choice). For an image like this, lossless compression will still usually give a small file size.bfris
Are you only trying to extract the "Approximate Gross Internal Area = 50.7 sq m / 546 sq ft" line?nathancy
@bfris The debate seemed to be between DPI and resolution, as DPI is just a display instruction. I.e resolution is important but DPI is not.Harvs
@nathancy yes that is the line, or more specifically '546 sq ft'Harvs

2 Answers

3
votes

All of the pixelation around the text makes it harder for Tesseract to do its thing. I used a simple brightness/contrast algorithm from here to make the dots go away. I didn't do any thresholding/binarization. But I did have to scale the image to get any character recognition.

import pytesseract   
import numpy as np
import cv2

img = cv2.imread('floor_original.jpg', 0) # read as grayscale
img = cv2.resize(img, (0,0), fx=2, fy=2)  # scale image 2X

alpha = 1.2
beta = -20
img = cv2.addWeighted( img, alpha, img, 0, beta)
cv2.imwrite('output.png', img)  

res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
print(res)

Edit There may be some platform/version dependence on above code. It runs on my Linux machine, but not on my Windows machine. To get it to run on Windows, I modified last two lines to

res = pytesseract.image_to_string(img, lang='eng', config='remove-background')
print(res.encode())

Output from tesseract(bolding added by me to emphasize sq footage):

TT xs?

IN

Approximate Gross Internal Area = 50.7 sq m / 546 sq ft

All dimensions are estimates only and may not be exact meas ent plans are subject lo change The sketches. renderngs graph matenala, lava, apectes

ne developer, the management company, the owners and other affiliates re rng oo all of ma ther sole discrebon and without enor scbioe

jements Araxs are approximate

Image after processing:

enter image description here

2
votes

By applying these few lines to resize and change contrast/brightness on your second image, after cropping the bottom quarter of the image :

img = cv2.imread("download.jpg")

img = cv2.resize(img, (0, 0), fx=2, fy=2)

img = cv2.convertScaleAbs(img, alpha=1.2, beta=-40)

text = pytesseract.image_to_string(img, config='-l eng --oem 1 --psm 3')

i managed to get this result :

TOTAL APPROX. FLOOR AREA 528 SQ.FT. (49.0 SQ.M.)

Whilst every attempt has been made to ensure the accuracy of the floor plan contained here, measurements: of doors, windows, rooms and any other items are approximate and no responsibility ts taken for any error, omission, or mis-statement. This plan is for @ustrative purposes only and should be used as such by any prospective purchaser. The services, systems and appliances shown have not been tested and no guarantee a8 to the operability or efficiency can be given Made with Metropix ©2019

I did not treshold the image as your images structures vary from one another, and since the image is not only text, OTSU Thresholding does not find the right value.

To answer everything: Tesseract actually work best with grayscale image (black text on white background).

About the DPI/Resolution question, there is indeed some debate but there is also some empirical truth : DPI value doesn't really matters (since text size can vary for same DPI). For Tesseract OCR to work best, your characters need to be (edited :) 30-33 pixels (height), smaller by a few px can make Tesseract almost useless, and bigger characters actually reduce accuracy, though not significantly. (edit : found the source -> https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ)

Finally, text format doesn't really change (at least in your examples). So your main problem here is text size, and the fact that you parse a whole page. If the text line you want is consistently at the bottom of the image, just extract (slice) your original image so you only feed Tesseract the relevent data, wich also will make it way faster.

EDIT : If you were also searching for a way to extract the square footage from your ocr'ed text :

text = "some place holder text 5471 square feet some more text"
# store here all the possible way it can be written
sqft_list = ["sq ft", "square feet", "sqft"]
extracted_value = ""

for sqft in sqft_list:
    if sqft in text:
        start = text.index(sqft) - 1
        end = start + len(sqft) + 1
        while text[start - 1] != " ":
            start -= 1
        extracted_value = text[start:end]
        break

print(extracted_value)

5471 square feet