0
votes

I am getting info in html format, n have to store it. by using beautifulsoup in python i can get the specific info but have to mention the class name in the filter. But am not getting any class name of that table. I want a dict like this : {"Product":"Choclate, Honey, Shampoo", "Quantity":"3, 1, 1", "Price":"45 , 32, 16"}

and the sample html is like this: Product Quantity Price Choclate
3 ₹ 45.00
Honey
2 ₹ 32.00
Shampoo
1 ₹ 16.00
<table align="center" cellspacing="0" cellpadding="6" width="95%" style="border:0;color:#000000;line-height:150%;text-align:left;font:300 14px/30px &#39;Helvetica Neue&#39;,Helvetica,Arial,sans-serif" border=".5px"><thead><tr style="background:#efefef"><th scope="col" width="50%" style="text-align:left;border:1px solid #eee">Product</th> <th scope="col" width="30%" style="text-align:right;border:1px solid #eee">Quantity</th> <th scope="col" width="30%" style="text-align:right;border:1px solid #eee">Price</th> </tr></thead><tbody><tr width="100%"><td width="50%" style="text-align:left;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0;word-wrap:break-word">Choclate<br><small></small></td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0">3</td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:1px solid #eee;border-top:0"><span>₹ 45.00<br><small></small></span></td> </tr><tr width="100%"><td width="50%" style="text-align:left;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0;word-wrap:break-word">Honey<br><small></small></td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0">2</td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:1px solid #eee;border-top:0"><span>₹ 32.00<br><small></small></span></td> </tr><tr width="100%"><td width="50%" style="text-align:left;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0;word-wrap:break-word">Shampoo<br><small></small></td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:0;border-top:0">1</td> <td width="30%" style="text-align:right;vertical-align:middle;border-left:1px solid #eee;border-bottom:1px solid #eee;border-right:1px solid #eee;border-top:0"><span>₹ 16.00<br><small></small></span></td> </tr></tbody><tfoot><tr><td scope="col" style="text-align:left;vertical-align:middle;border-left:0;border-bottom:0;border-right:0;border-top:0;word-wrap:break-word"></td

1

1 Answers

1
votes

You don't have to give a class name. If it is the only table simply search for the table tag, else you'll have to look at the surrounding HTML elements and the whole path from the <body> element to that table if there are any classes or identifiers or anything else to single out this particular table. If this all fails you may have search for a header cell containing the word Product and work your way up to the <table> element from there.

As I don't know the surrounding HTML I'll show the fallback solution to search for the header cell with a specific text value:

#!/usr/bin/env python
from __future__ import absolute_import, division, print_function
from pprint import pprint
from bs4 import BeautifulSoup


def main():
    with open('test.html') as html_file:
        soup = BeautifulSoup(html_file)

    header_row_node = soup.find('th', text='Product').parent
    headers = list(header_row_node.stripped_strings)
    header2values = dict((h, list()) for h in headers)
    for row_node in header_row_node.find_parent('table').tbody('tr'):
        product, quantity, price = row_node.stripped_strings
        price = price.split()[-1]  # Just take the number part.
        for header, value in zip(headers, [product, quantity, price]):
            header2values[header].append(value)

    result = dict((h, ', '.join(vs)) for h, vs in header2values.iteritems())
    pprint(result)



if __name__ == '__main__':
    main()

For the given test data (which I slightly corrected/completed before saving it as test.html) this prints:

{u'Price': u'45.00, 32.00, 16.00',
 u'Product': u'Choclate, Honey, Shampoo',
 u'Quantity': u'3, 2, 1'