##### Update ###### : renderContents() instead of contents[0] did the trick. I will still leave it open if someone can provide a better, elegant solution!
I am trying to parse a number of Web pages for the desired data. The table doesn't have a class/ID tag. So I have to search for 'website' in tr contents.
Problem at hand : Displaying td.contents works fine with just text but not hyperlinks for some reason? What am I doing wrong? Is there a better way of doing this using bs in Python?
Those suggesting lxml, I have an ongoing thread here centOS and lxml installation without admin privileges is proving to be a handful at this time. Hence exploring the BeautifulSoup option.
HTML Sample :
<table border="2" width="100%">
<tbody><tr>
<td width="33%" class="BoldTD">Website</td>
<td width="33%" class="BoldTD">Last Visited</td>
<td width="34%" class="BoldTD">Last Loaded</td>
</tr>
<tr>
<td width="33%">
<a href="http://google.com"></a>
</td>
<td width="33%">01/14/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
stackoverflow.com
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
<tr>
<td width="33%">
<a href="http://stackoverflow.com"></a>
</td>
<td width="33%">01/10/2011
</td>
<td width="34%">
</td>
</tr>
</tbody></table>
Python code so far :
f1 = open(PATH + "/" + FILE)
pageSource = f1.read()
f1.close()
soup = BeautifulSoup(pageSource)
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
print "Number of tables found : " , len(alltables)
for table in alltables:
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
print td.contents[0]
<a href="http://google.com"</a>
be<a href="http://google.com"></a>
(i.e. is it missing a>
?) – unutbu