0
votes

I'm working in Python2.7 as a beginner. I want to parse and modify some html file. For this I'm using Beautiful Soup and lxml is also one option. Now the problem is Can I surround a text with some html tag by modifying the html. The text is directly under the 'body' tag , So what ever text is directly under the body tag I want to modify the html so that I can get the text under my desired tag. So I can parse it and find out the location of this text easily.

<html><body>
<b>List Price:</b>
<strike>$150.00</strike><br />
<b>Price</b>
$117.80<br />
<b>You Save:</b>
$32.20(21%)<br />
<font size="-1" color="#009900">In Stock</font>
<br />
<a href="/gp/aw/help/id=sss/ref=aw_d_sss_shoes">Free Shipping</a>
<br/>
Ships from and sold by Amazon.com<br />
Gift-wrap available.<br /></body></html>

So here In this example I want to surround the text '$117.80' and '$32.20' with some user html tag. How can I achieve this with Beautifulsoup or lxml.

1

1 Answers

0
votes

I think you want to surround tail text, and I would choose better that to handle them. The following script searches for any element that contains tail text, creates a new <div> tag (choose yours) and inserts it there. It uses a regular expression to check that the text seems a price and this way skips the text in the end of Ships from and sold by Amazon.com or Gift-wrap available.:

from lxml import etree
import re

tree = etree.parse('htmlfile')
root = tree.getroot()

for elem in root.iter('*'):
    if elem.tail is not None and elem.tail.strip() and re.search('\$\d+', elem.tail):
        e = etree.Element('div')
        e.text = elem.tail
        elem.tail = ''
        elem.addnext(e)

print(etree.tostring(root))

It yields:

<html><body>
<b>List Price:</b>
<strike>$150.00</strike><br/>
<b>Price</b><div>
$117.80</div><br/>
<b>You Save:</b><div>
$32.20(21%)</div><br/>
<font size="-1" color="#009900">In Stock</font>
<br/>
<a href="/gp/aw/help/id=sss/ref=aw_d_sss_shoes">Free Shipping</a>
<br/>
Ships from and sold by Amazon.com<br/>
Gift-wrap available.<br/></body></html>