Scrape a Wikipedia table using beautifulsoup

Question

I've been trying to scrape a table on Wikipedia using Beautifulsoup, but encountered some problems.

Page: https://en.wikipedia.org/wiki/New_York_City Table: enter image description here

Table: "Racial composition"

In the page source, the table seems to start at row 1470.

Here's the code I tried first:

website_url = requests.get('https://en.wikipedia.org/wiki/New_York_City').text
soup = BeautifulSoup(website_url,'lxml')
table = soup.find('table',{'class':'wikitable sortable collapsible'})

headers = [header.text for header in table.find_all('th')]

table_rows = table.find_all('tr')        
rows = []
for row in table_rows:
   td = row.find_all('td')
   row = [row.text for row in td]
   rows.append(row)

with open('NYC_DEMO.csv', 'w') as f:
   writer = csv.writer(f)
   writer.writerow(headers)
   writer.writerows(row for row in rows if row)

And here's the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-e6000bdafe11> in <module>
      3 table = soup.find('table',{'class':'wikitable sortable collapsible'})
      4 
----> 5 headers = [header.text for header in table.find_all('th')]
      6 
      7 table_rows = table.find_all('tr')

AttributeError: 'NoneType' object has no attribute 'find_all'

I suppose this is code from the Wikipedia page that we'd need to get:

<tbody><tr>
<th>Racial composition</th>
<th>2010<sup id="cite_ref-QuickFacts2010_226-1" class="reference"><a href="#cite_note-QuickFacts2010-226">&#91;224&#93;</a></sup></th>
<th>1990<sup id="cite_ref-pop_228-0" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup></th>
<th>1970<sup id="cite_ref-pop_228-1" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup></th>
<th>1940<sup id="cite_ref-pop_228-2" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup>
</th></tr>
<tr>
<td><a href="/wiki/White_American" class="mw-redirect" title="White American">White</a></td>
<td>44.0%</td>
<td>52.3%</td>
<td>76.6%</td>
<td>93.6%
</td></tr>
<tr>
...

I'm guessing it can't locate the right table? There's quite some tables on that page so how do I correctly point towards that table?

Thanks in advance for your help.

If ur interested only in table contents why not use pandas read_html stackoverflow.com/questions/43344580/… — sushanth

Andrei Mustață Andrei Mustață · Accepted Answer · 2020-05-29T09:07:31

I'm guessing it can't locate the right table?

That seems to be the case, yes. If you check the value of table, you'll see that it is None, and that's why calling find_all on it fails.

If you check the table on the page, you'll see that its classes are wikitable collapsible collapsed mw-collapsible mw-made-collapsible, and there's no sortable class in there. This is why your program doesn't find any matching table element.

There's quite some tables on that page so how do I correctly point towards that table?

First, you you can hook up to some unique identifier, such as an id of the element, but there's none available in your case. Had it had any thead, or a caption of some sort, you could have tried with that, but again, it's not the case.

Then, you need to go further up the DOM tree, and check its parents if they have any uniquely identifiers. With the plan being that you add the parent in the selector. Unfortunately, the body of Wikipedia articles seem to be wrapped in just one big element, without semantically separating the sections. This makes it harder to scrape.

At this point, I would say you're left with just looking at the browser page and thinking about how you would naturally identify the table (non-programatically). You look at it and see it's got the Racial composition in the heading. And you can grab it with something like

table_heading = soup.find('th', text='Racial composition')      # this gives you the `th`
if table_heading:
    table = table_heading.find_parents('table')

There might be some other beautifulsoup APIs I don't know, but you can drop this in your code and it should work.

Scrape a Wikipedia table using beautifulsoup

2 Answers