0
votes

Here's my input data/file which I got by converting from docx to html. For this reason the file is "not prettyfied" html.

<p><strong>TABLE 1  EUR A++ AT JON DOE'S PLACE 27-06-2020</strong></p><p>   13.30   1   Jennifer Shea (R)       27-06   29.99   Tea 1/1 Fish and chips or Salad and rice</p><p> 15.00   2   Micheal Beltran (R)     27-06   30.99   Wine    2/2 Vegan super silky puff pastry</p><p>    16.00   3   <strong>DARIUS GALLAGHER (IP)</strong>      27-06   29.99   Wine Premium    4/4 Pear pecan and blue cheese salad</p><p> 18.00   4   Ashanti Fields (R)      27-06   N/A Wine    2/0 </p>

Reading the file/data into Python (with bs4), in line 2 we see:

'\t13.30\t1\tJennifer Shea (R)\t\t27-06\t29.99\tTea\t1/1\tFish and chips or Salad and rice'

The text inside the <p></p> tags are seperated with tabs ('\t').

I want to split each <p></p> line content into smaller pieces (i) using '\t' as a seperator, (ii) change the element tags to table rows and (iii) set (predefined) tags with class names for each item. For example I want above paragraph line converted to:

<table>
    <tbody>
        <tr>
            <td colspan="2" class="header-table-no">TABLE 1</td>
            <td colspan="2" class="header-currency">EUR</td>
            <td colspan="2" class="header-class">A++</td>
            <td colspan="2" class="header-place-date">AT JON DOE'S PLACE 27-06-2020</td>
        </tr>
        <tr>
            <td class="cooktable-time">13.30</td>
            <td class="cooktable-no">1</td>
            <td class="cooktable-name">Jennifer Shea (R)</td>
            <td class="cooktable-date">27-06</td>
            <td class="cooktable-price">29.99</td>
            <td class="cooktable-drink">Tea</td>
            <td class="cooktable-ppfood ">1/1</td>
            <td class="cooktable-menu">Fish and chips or Salad and rice</td>
        </tr>
    </tbody>
</table>

There are about 3000 more tables to loop over and the data is consistent. So far I've managed to split every line and add into a list:

from bs4 import BeautifulSoup                                                     
soup = BeautifulSoup(open("simplesample.html"), features="lxml") 
paragraphs = soup.find_all("p")
delimeter ='\t'
total_list=[]

for i in range(len(paragraphs)): 
     plist = (paragraphs[i].get_text()).split(delimeter) 
     total_list.append(plist) 


for i in total_list: 
     print(i) 
# Output                                                                                  
['TABLE 1', 'EUR', 'A++', "AT JON DOE'S PLACE 27-06-2020"]
['', '13.30', '1', 'Jennifer Shea (R)', '', '27-06', '29.99', 'Tea', '1/1', 'Fish and chips or Salad and rice']
['', '15.00', '2', 'Micheal Beltran (R)', '', '27-06', '30.99', 'Wine', '2/2', 'Vegan super silky puff pastry']
['', '16.00', '3', 'DARIUS GALLAGHER (IP)', '', '27-06', '29.99', 'Wine Premium', '4/4', 'Pear pecan and blue cheese salad']
['', '18.00', '4', 'Ashanti Fields (R)', '', '27-06', 'N/A', 'Wine', '2/0', '']

EDIT: Problem here is that when I use str.split() method on data I got from get_text(), I lose some format information. E.g: In the original data, on line 3 'DARIUS GALLAGHER (IP)' is actually inside <strong></strong> tag. But using get_text() naturally, returns only the string value. So the output becomes non-bold.

UPDATE: To clarify, I would like to keep <strong> as they are. Except for the headers, <strong> tags appear randomly in the data.

Questions are:

  1. Is there any bs4 method to solve this by keeping current formats, instead of splitting tags as strings and adding to list?
  2. From here, how can I tag list elements with predetermined tags and class names using BeautifulSoup ?
1
Can you share URL (if there's any)? Generally, beautifulsoup is good for parsing tags, not plain text - for that re module is better. In this case I see it for some combination of both.Andrej Kesely
@AndrejKesely Unfortunately there is no URL. I receive this file in .docx format and convert it to .html using some other tools. The original file contains hundreds more data just like the one I provided. I am not very good at regex, what kind of solution do you have in mind? Thanks.rushas
I posted answer with a suggestion how it could look like.Andrej Kesely

1 Answers

1
votes

Without the real data it's hard to come-up with 100% correct solution, but this example can get you started (it uses only beautifulsoup module and itertools.zip_longest for filling up missing cooktable-menu entries):

from bs4 import BeautifulSoup
from itertools import zip_longest


txt = '''<p><strong>TABLE 1\tEUR\tA++\tAT JON DOE'S PLACE 27-06-2020</strong></p><p>\t13.30\t1\tJennifer Shea (R)\t\t27-06\t29.99\tTea\t1/1\tFish and chips or Salad and rice</p><p>\t15.00\t2\tMicheal Beltran (R)\t\t27-06\t30.99\tWine\t2/2\tVegan super silky puff pastry</p><p>\t16.00\t3\t<strong>DARIUS GALLAGHER (IP)</strong>\t\t27-06\t29.99\tWine Premium\t4/4\tPear pecan and blue cheese salad</p><p>\t18.00\t4\tAshanti Fields (R)\t\t27-06\tN/A\tWine\t2/0\t</p>'''


template_header = '''        <tr>
            <td colspan="2" class="header-table-no">{table_no}</td>
            <td colspan="2" class="header-currency">{currency}</td>
            <td colspan="2" class="header-class">{class_}</td>
            <td colspan="2" class="header-place-date">{place_date}</td>
        </tr>'''

template_row = '''        <tr>
            <td class="cooktable-time">{time}</td>
            <td class="cooktable-no">{no}</td>
            <td class="cooktable-name">{name}</td>
            <td class="cooktable-date">{date}</td>
            <td class="cooktable-price">{price}</td>
            <td class="cooktable-drink">{drink}</td>
            <td class="cooktable-ppfood ">{ppfood}</td>
            <td class="cooktable-menu">{menu}</td>
        </tr>'''

template = '''<table>
    <tbody>
{header}
{rows}
    </tbody>
</table>'''


soup = BeautifulSoup(txt, 'html.parser')

# remove strong tags:
for strong in soup.select('strong'):
    p = strong.parent
    strong.unwrap()
    p.smooth()

all_p = soup.select('p')

table_no, currency, class_, place_date = all_p[0].get_text(strip=True).split('\t')
h = template_header.format(table_no=table_no, currency=currency, class_=class_, place_date=place_date)

all_data = []
for p in all_p[1:]:
    all_data.append(p.get_text(strip=True).split('\t'))

r = []
for time, no, name, _, date, price, drink, ppfood, menu in zip(*zip_longest(*all_data, fillvalue='')):
    r.append(template_row.format(time=time, no=no, name=name, date=date, price=price, drink=drink, ppfood=ppfood, menu=menu))

print(template.format(header=h, rows='\n'.join(r)))

Prints:

<table>
    <tbody>
        <tr>
            <td colspan="2" class="header-table-no">TABLE 1</td>
            <td colspan="2" class="header-currency">EUR</td>
            <td colspan="2" class="header-class">A++</td>
            <td colspan="2" class="header-place-date">AT JON DOE'S PLACE 27-06-2020</td>
        </tr>
        <tr>
            <td class="cooktable-time">13.30</td>
            <td class="cooktable-no">1</td>
            <td class="cooktable-name">Jennifer Shea (R)</td>
            <td class="cooktable-date">27-06</td>
            <td class="cooktable-price">29.99</td>
            <td class="cooktable-drink">Tea</td>
            <td class="cooktable-ppfood ">1/1</td>
            <td class="cooktable-menu">Fish and chips or Salad and rice</td>
        </tr>
        <tr>
            <td class="cooktable-time">15.00</td>
            <td class="cooktable-no">2</td>
            <td class="cooktable-name">Micheal Beltran (R)</td>
            <td class="cooktable-date">27-06</td>
            <td class="cooktable-price">30.99</td>
            <td class="cooktable-drink">Wine</td>
            <td class="cooktable-ppfood ">2/2</td>
            <td class="cooktable-menu">Vegan super silky puff pastry</td>
        </tr>
        <tr>
            <td class="cooktable-time">16.00</td>
            <td class="cooktable-no">3</td>
            <td class="cooktable-name">DARIUS GALLAGHER (IP)</td>
            <td class="cooktable-date">27-06</td>
            <td class="cooktable-price">29.99</td>
            <td class="cooktable-drink">Wine Premium</td>
            <td class="cooktable-ppfood ">4/4</td>
            <td class="cooktable-menu">Pear pecan and blue cheese salad</td>
        </tr>
        <tr>
            <td class="cooktable-time">18.00</td>
            <td class="cooktable-no">4</td>
            <td class="cooktable-name">Ashanti Fields (R)</td>
            <td class="cooktable-date">27-06</td>
            <td class="cooktable-price">N/A</td>
            <td class="cooktable-drink">Wine</td>
            <td class="cooktable-ppfood ">2/0</td>
            <td class="cooktable-menu"></td>
        </tr>
    </tbody>
</table>

EDIT: Version without extracting <strong> tags:

from bs4 import BeautifulSoup
from itertools import zip_longest, chain


txt = '''<p><strong>TABLE 1\tEUR\tA++\tAT JON DOE'S PLACE 27-06-2020</strong></p><p>\t13.30\t1\tJennifer Shea (R)\t\t27-06\t29.99\tTea\t1/1\tFish and chips or Salad and rice</p><p>\t15.00\t2\tMicheal Beltran (R)\t\t27-06\t30.99\tWine\t2/2\tVegan super silky puff pastry</p><p>\t16.00\t3\t<strong>DARIUS GALLAGHER (IP)</strong>\t\t27-06\t29.99\tWine Premium\t4/4\tPear pecan and blue cheese salad</p><p>\t18.00\t4\tAshanti Fields (R)\t\t27-06\tN/A\tWine\t2/0\t</p>'''

template_header = '''        <tr>
            <td colspan="2" class="header-table-no">{table_no}</td>
            <td colspan="2" class="header-currency">{currency}</td>
            <td colspan="2" class="header-class">{class_}</td>
            <td colspan="2" class="header-place-date">{place_date}</td>
        </tr>'''

template_row = '''        <tr>
            <td class="cooktable-time">{time}</td>
            <td class="cooktable-no">{no}</td>
            <td class="cooktable-name">{name}</td>
            <td class="cooktable-date">{date}</td>
            <td class="cooktable-price">{price}</td>
            <td class="cooktable-drink">{drink}</td>
            <td class="cooktable-ppfood">{ppfood}</td>
            <td class="cooktable-menu">{menu}</td>
        </tr>'''

template = '''<table>
    <tbody>
{header}
{rows}
    </tbody>
</table>'''


soup = BeautifulSoup(txt, 'html.parser')

all_p = soup.select('p')
table_no, currency, class_, place_date = all_p[0].get_text(strip=True).split('\t')
h = template_header.format(table_no=table_no, currency=currency, class_=class_, place_date=place_date)

all_data = []
for p in all_p[1:]:
    all_data.append([t.strip() for t in chain.from_iterable(t.split('\t') for t in p.find_all(text=True)) if t.strip()])

r = []
for time, no, name, date, price, drink, ppfood, menu in zip(*zip_longest(*all_data, fillvalue='')):
    r.append(template_row.format(time=time, no=no, name=name, date=date, price=price, drink=drink, ppfood=ppfood, menu=menu))

print(template.format(header=h, rows='\n'.join(r)))