0
votes

I have a .jpg containing an image of a table which I am attempting to extract to Excel, using Python.

enter image description here

I am following an example from here:

https://towardsdatascience.com/a-table-detection-cell-recognition-and-text-extraction-algorithm-to-convert-tables-to-excel-files-902edcf289ec

I have hit a problem though, where the horizontal rows are not being identified. In the source image (above) you can see that the horizontal rows are much lighter than the vertical columns, but they are visible in the source and I believe they should still be detected.

I have altered the cv2.threshold value almost every way I can think of, but still this has no affect on the returned image (see below):

  • thresh, img_bin = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
  • thresh, img_bin = cv2.threshold(img, 0, 256, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

Results in the same image:

enter image description here

import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

# read your file
file = r'venv/images/iiCrop.jpg'
img = cv2.imread(file, 0)
img.shape
# thresholding the image to a binary image
thresh, img_bin = cv2.threshold(img, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
# inverting the image
img_bin = 255 - img_bin
cv2.imwrite('venv/images/cv_inverted.png', img_bin)
# Plotting the image to see the output
plotting = plt.imshow(img_bin, cmap='gray')
plt.show()

Is there something obvious, or not so obvious I am doing wrong?

1

1 Answers

1
votes

You must loose cv2.THRESH_OTSU to be able to adjust the threshold value manually. Also you can use cv2.THRESH_BINARY_INV to invert the binary image. Some lines are too light to be detected without jpeg noise.

thresh, img_bin = cv2.threshold(img, 230, 255, cv2.THRESH_BINARY_INV)

result

I'd recommend reading the official tutorial on thresholding images