I read a PDF file with PDFMiner and I get a string; following that structure:
text
text
text
col1
1
2
3
4
5
col2
(1)
(2)
(3)
(7)
(4)
col3
name1
name2
name3
name4
name5
col4
name
5
45
7
87
8
col5
FAE
EFD
SDE
FEF
RGE
col6
name
45
7
54
4
130
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
7
1
8
text1
text1
text1
col1
6
7
8
9
10
col2
(1)
(2)
(3)
(7)
(4)
col3
name6
name7
name8
name9
name10
col4
name
54
4
78
8
86
col5
SDE
FFF
EEF
GFE
JHG
col6
name
6
65
65
45
78
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
4
1
54
I have 10 columns named: col1, col2, col3, col4 name, col5, col6 name, # col7, col8, col9, col10 name. But as I have those 10 columns on each page; I get the structure repeated. Those names will always be the same, on each page. I am not sure how to pull it all in the same dataframe. For example for col1 I would have in the dataframe:
1
2
3
4
5
6
7
8
9
10
I also have some empty columns (col8 in my example) and I am not sure how to deal with it.
Any idea? thanks!