Here's my input data/file which I got by converting from docx to html. For this reason the file is "not prettyfied" html.
<p><strong>TABLE 1 EUR A++ AT JON DOE'S PLACE 27-06-2020</strong></p><p> 13.30 1 Jennifer Shea (R) 27-06 29.99 Tea 1/1 Fish and chips or Salad and rice</p><p> 15.00 2 Micheal Beltran (R) 27-06 30.99 Wine 2/2 Vegan super silky puff pastry</p><p> 16.00 3 <strong>DARIUS GALLAGHER (IP)</strong> 27-06 29.99 Wine Premium 4/4 Pear pecan and blue cheese salad</p><p> 18.00 4 Ashanti Fields (R) 27-06 N/A Wine 2/0 </p>
Reading the file/data into Python (with bs4), in line 2 we see:
'\t13.30\t1\tJennifer Shea (R)\t\t27-06\t29.99\tTea\t1/1\tFish and chips or Salad and rice'
The text inside the <p></p>
tags are seperated with tabs ('\t').
I want to split each <p></p>
line content into smaller pieces (i) using '\t' as a seperator, (ii) change the element tags to table rows and (iii) set (predefined) tags with class names for each item. For example I want above paragraph line converted to:
<table>
<tbody>
<tr>
<td colspan="2" class="header-table-no">TABLE 1</td>
<td colspan="2" class="header-currency">EUR</td>
<td colspan="2" class="header-class">A++</td>
<td colspan="2" class="header-place-date">AT JON DOE'S PLACE 27-06-2020</td>
</tr>
<tr>
<td class="cooktable-time">13.30</td>
<td class="cooktable-no">1</td>
<td class="cooktable-name">Jennifer Shea (R)</td>
<td class="cooktable-date">27-06</td>
<td class="cooktable-price">29.99</td>
<td class="cooktable-drink">Tea</td>
<td class="cooktable-ppfood ">1/1</td>
<td class="cooktable-menu">Fish and chips or Salad and rice</td>
</tr>
</tbody>
</table>
There are about 3000 more tables to loop over and the data is consistent. So far I've managed to split every line and add into a list:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("simplesample.html"), features="lxml")
paragraphs = soup.find_all("p")
delimeter ='\t'
total_list=[]
for i in range(len(paragraphs)):
plist = (paragraphs[i].get_text()).split(delimeter)
total_list.append(plist)
for i in total_list:
print(i)
# Output
['TABLE 1', 'EUR', 'A++', "AT JON DOE'S PLACE 27-06-2020"]
['', '13.30', '1', 'Jennifer Shea (R)', '', '27-06', '29.99', 'Tea', '1/1', 'Fish and chips or Salad and rice']
['', '15.00', '2', 'Micheal Beltran (R)', '', '27-06', '30.99', 'Wine', '2/2', 'Vegan super silky puff pastry']
['', '16.00', '3', 'DARIUS GALLAGHER (IP)', '', '27-06', '29.99', 'Wine Premium', '4/4', 'Pear pecan and blue cheese salad']
['', '18.00', '4', 'Ashanti Fields (R)', '', '27-06', 'N/A', 'Wine', '2/0', '']
EDIT: Problem here is that when I use str.split() method on data I got from get_text(), I lose some format information. E.g: In the original data, on line 3 'DARIUS GALLAGHER (IP)'
is actually inside <strong></strong>
tag. But using get_text()
naturally, returns only the string value. So the output becomes non-bold.
UPDATE: To clarify, I would like to keep <strong>
as they are. Except for the headers, <strong>
tags appear randomly in the data.
Questions are:
- Is there any bs4 method to solve this by keeping current formats, instead of splitting tags as strings and adding to list?
- From here, how can I tag list elements with predetermined tags and class names using BeautifulSoup ?
beautifulsoup
is good for parsing tags, not plain text - for thatre
module is better. In this case I see it for some combination of both. – Andrej Kesely.docx
format and convert it to.html
using some other tools. The original file contains hundreds more data just like the one I provided. I am not very good at regex, what kind of solution do you have in mind? Thanks. – rushas