import lxml.html as lh
import urllib2
def text_tail(node):
yield node.text
yield node.tail
url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
text=elt.text_content()
if text.startswith('Additional Info'):
blurb=[text for node in elt.itersiblings('td')
for subnode in node.iter()
for text in text_tail(subnode) if text and text!=u'\xa0']
break
print('\n'.join(blurb))
yields
For over 65 years, Carl Stirn's Marine
has been setting new standards of
excellence and service for boating
enjoyment. Because we offer quality
merchandise, caring, conscientious,
sales and service, we have been able
to make our customers our good
friends.
Our 26,000 sq. ft. facility includes a
complete parts and accessories
department, full service department
(Merc. Premier dealer with 2 full time
Mercruiser Master Tech's), and new,
used, and brokerage sales.
Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:
import lxml.html as lh
import urllib2
url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
blurb=doc.xpath('//td[child::*[text()="Additional Info"]]/following-sibling::td/text()')
blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td[2]/div/div[2]/table/tbody/tr/td/div/div/table/tbody/tr[8]/td[3]
. You'd probably start at//*[@id="BelowTheFold"]
. I think the tbodies should be removed. Is the text "additional info" always there? – SiggyF