0
votes

I read a PDF file with PDFMiner and I get a string; following that structure:

text
text
text

col1
1
2
3
4
5

col2
(1)
(2)
(3)
(7)
(4)

col3
name1
name2
name3
name4
name5

col4
 name
5
45
7
87
8

col5
FAE
EFD
SDE
FEF
RGE

col6
 name
45
7
54
4
130

# col7
16
18
22
17
25

col8

col9
55
30
60
1
185

col10
name

1
7
1
8

text1
text1
text1

col1
6
7
8
9
10

col2
(1)
(2)
(3)
(7)
(4)

col3
name6
name7
name8
name9
name10

col4
 name
54
4
78
8
86

col5
SDE
FFF
EEF
GFE
JHG

col6
 name
6
65
65
45
78

# col7
16
18
22
17
25

col8

col9
55
30
60
1
185

col10
name

1
4
1
54

I have 10 columns named: col1, col2, col3, col4 name, col5, col6 name, # col7, col8, col9, col10 name. But as I have those 10 columns on each page; I get the structure repeated. Those names will always be the same, on each page. I am not sure how to pull it all in the same dataframe. For example for col1 I would have in the dataframe:

1
2
3
4
5
6
7
8
9
10

I also have some empty columns (col8 in my example) and I am not sure how to deal with it.

Any idea? thanks!

1

1 Answers

3
votes

You can use regex to parse the document (regex101), for example (txt is your string from the question):

import re

d = {}
for col_name, cols in re.findall(r'\n^((?:#\s)?col\d+(?:\n\s*name\n+)?)(.*?)(?=\n\n|^(?:#\s)?col\d+|\Z)', txt, flags=re.M|re.S):
    d.setdefault(col_name.strip(), []).extend(cols.strip().split('\n'))

df = pd.DataFrame.from_dict(d, orient='index').T
print(df)

Prints:

  col1 col2    col3 col4\n name col5 col6\n name # col7  col8 col9 col10\nname
0    1  (1)   name1           5  FAE          45     16         55           1
1    2  (2)   name2          45  EFD           7     18         30           7
2    3  (3)   name3           7  SDE          54     22  None   60           1
3    4  (7)   name4          87  FEF           4     17  None    1           8
4    5  (4)   name5           8  RGE         130     25  None  185           1
5    6  (1)   name6          54  SDE           6     16  None   55           4
6    7  (2)   name7           4  FFF          65     18  None   30           1
7    8  (3)   name8          78  EEF          65     22  None   60          54
8    9  (7)   name9           8  GFE          45     17  None    1        None
9   10  (4)  name10          86  JHG          78     25  None  185        None