BeautifulSoup (Python) and parsing HTML table

Question

##### Update ###### : renderContents() instead of contents[0] did the trick. I will still leave it open if someone can provide a better, elegant solution!

I am trying to parse a number of Web pages for the desired data. The table doesn't have a class/ID tag. So I have to search for 'website' in tr contents.

Problem at hand : Displaying td.contents works fine with just text but not hyperlinks for some reason? What am I doing wrong? Is there a better way of doing this using bs in Python?

Those suggesting lxml, I have an ongoing thread here centOS and lxml installation without admin privileges is proving to be a handful at this time. Hence exploring the BeautifulSoup option.

HTML Sample :

                   <table border="2" width="100%">
                      <tbody><tr>
                        <td width="33%" class="BoldTD">Website</td>
                        <td width="33%" class="BoldTD">Last Visited</td>
                        <td width="34%" class="BoldTD">Last Loaded</td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://google.com"></a>
                        </td>
                        <td width="33%">01/14/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                      <tr>
                        <td width="33%">
                          stackoverflow.com
                        </td>
                        <td width="33%">01/10/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                      <tr>
                        <td width="33%">
                          <a href="http://stackoverflow.com"></a>
                        </td>
                        <td width="33%">01/10/2011
                                </td>
                        <td width="34%">
                                </td>
                      </tr>
                    </tbody></table>

Python code so far :

        f1 = open(PATH + "/" + FILE)
        pageSource = f1.read()
        f1.close()
        soup = BeautifulSoup(pageSource)
        alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
        print "Number of tables found : " , len(alltables)

        for table in alltables:
            rows = table.findAll('tr')
            for tr in rows:
                cols = tr.findAll('td')
                for td in cols:
                    print td.contents[0]

Should <a href="http://google.com"</a> be <a href="http://google.com"></a> (i.e. is it missing a >?) — unutbu

Tauquir Tauquir · Accepted Answer · 2011-01-25T18:42:38

I answered a similar question here . Hope it will help you.

A lay man solution:

alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )

t = [x for x in soup.findAll('td')]

[x.renderContents().strip('\n') for x in t]

Output:

['Website',
 'Last Visited',
 'Last Loaded',
 '<a href="http://google.com"></a>',
 '01/14/2011\n                                ',
 '',
 '                          stackoverflow.com\n                        ',
 '01/10/2011\n                                ',
 '',
 '<a href="http://stackoverflow.com"></a>',
 '01/10/2011\n                                ',
 '']

BeautifulSoup (Python) and parsing HTML table

2 Answers