0
votes

I have read tables from pdf using tabula-py command with the following code:

table = tabula.read_pdf(files[0],pages = 'all',multiple_tables = True, stream = True)

Sometimes values from two columns are joined into a single column(separated by single space). For example:

col0 col1 col2 col3 col4 col5 col6 col7
a1 b1 c1 d1 e1 f1 g1 h1 NA NA
a2 b2 c2 d2 e2 f2 g2 h2

How can i readjust the values into the correct columns, to get:

col0 col1 col2 col3 col4 col5 col6 col7
a1 b1 c1 d1 e1 f1 g1 h1
a2 b2 c2 d2 e2 f2 g2 h2
2

2 Answers

2
votes
  • output as space delimited
  • replace quoted strings from step 1
  • read back as space delimited
import io
df = pd.read_csv(io.StringIO("""col0    col1    col2    col3    col4    col5    col6    col7
a1  b1 c1   d1  e1 f1   g1  h1  NA  NA
a2  b2  c2  d2  e2  f2  g2  h2"""), sep="\t")

df = pd.read_csv(io.StringIO(df.to_csv(sep=" ").replace("\"", "")), sep="\s+")

output

col0 col1 col2 col3 col4 col5 col6 col7
  a1   b1   c1   d1   e1   f1   g1   h1
  a2   b2   c2   d2   e2   f2   g2   h2
0
votes

Could you try

table = tabula.read_pdf(files[0],pages = 'all',multiple_tables = True,guess = False, stream = True)