0
votes

I've been trying to scrape a table on Wikipedia using Beautifulsoup, but encountered some problems.

Page: https://en.wikipedia.org/wiki/New_York_City Table: enter image description here

Table: "Racial composition"

In the page source, the table seems to start at row 1470.

Here's the code I tried first:

website_url = requests.get('https://en.wikipedia.org/wiki/New_York_City').text
soup = BeautifulSoup(website_url,'lxml')
table = soup.find('table',{'class':'wikitable sortable collapsible'})

headers = [header.text for header in table.find_all('th')]

table_rows = table.find_all('tr')        
rows = []
for row in table_rows:
   td = row.find_all('td')
   row = [row.text for row in td]
   rows.append(row)

with open('NYC_DEMO.csv', 'w') as f:
   writer = csv.writer(f)
   writer.writerow(headers)
   writer.writerows(row for row in rows if row)

And here's the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-e6000bdafe11> in <module>
      3 table = soup.find('table',{'class':'wikitable sortable collapsible'})
      4 
----> 5 headers = [header.text for header in table.find_all('th')]
      6 
      7 table_rows = table.find_all('tr')

AttributeError: 'NoneType' object has no attribute 'find_all'

I suppose this is code from the Wikipedia page that we'd need to get:

<tbody><tr>
<th>Racial composition</th>
<th>2010<sup id="cite_ref-QuickFacts2010_226-1" class="reference"><a href="#cite_note-QuickFacts2010-226">&#91;224&#93;</a></sup></th>
<th>1990<sup id="cite_ref-pop_228-0" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup></th>
<th>1970<sup id="cite_ref-pop_228-1" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup></th>
<th>1940<sup id="cite_ref-pop_228-2" class="reference"><a href="#cite_note-pop-228">&#91;226&#93;</a></sup>
</th></tr>
<tr>
<td><a href="/wiki/White_American" class="mw-redirect" title="White American">White</a></td>
<td>44.0%</td>
<td>52.3%</td>
<td>76.6%</td>
<td>93.6%
</td></tr>
<tr>
...

I'm guessing it can't locate the right table? There's quite some tables on that page so how do I correctly point towards that table?

Thanks in advance for your help.

2
If ur interested only in table contents why not use pandas read_html stackoverflow.com/questions/43344580/…sushanth

2 Answers

1
votes

I'm guessing it can't locate the right table?

That seems to be the case, yes. If you check the value of table, you'll see that it is None, and that's why calling find_all on it fails.

If you check the table on the page, you'll see that its classes are wikitable collapsible collapsed mw-collapsible mw-made-collapsible, and there's no sortable class in there. This is why your program doesn't find any matching table element.

There's quite some tables on that page so how do I correctly point towards that table?

First, you you can hook up to some unique identifier, such as an id of the element, but there's none available in your case. Had it had any thead, or a caption of some sort, you could have tried with that, but again, it's not the case.

Then, you need to go further up the DOM tree, and check its parents if they have any uniquely identifiers. With the plan being that you add the parent in the selector. Unfortunately, the body of Wikipedia articles seem to be wrapped in just one big element, without semantically separating the sections. This makes it harder to scrape.

At this point, I would say you're left with just looking at the browser page and thinking about how you would naturally identify the table (non-programatically). You look at it and see it's got the Racial composition in the heading. And you can grab it with something like

table_heading = soup.find('th', text='Racial composition')      # this gives you the `th`
if table_heading:
    table = table_heading.find_parents('table')

There might be some other beautifulsoup APIs I don't know, but you can drop this in your code and it should work.

1
votes

The issue is it will not return a table with class="wikitable sortable collapsible" because it's not explicitly in the html. you would need to use regex to find classes that CONTAIN that substring, as that would work. Secondly, .find() will only return the first element it finds. Unless the table you are trying to grab has a specific, and unique attribute to identify it, using .find() won't work. If there are multiple elements, you need to use .find_all(), and even then you'd need to iterate through those to get the table you want.

As someone stated, you could also use pandas' .read_html(). This will return all the table tags in a list, then it's a matter of finding the index posiotn of the table you want. I provided both options for you:

Using Pandas:

import pandas as pd

url = 'https://en.wikipedia.org/wiki/New_York_City'

df = pd.read_html(url)[9]
df.to_csv('NYC_DEMO.csv',index=False)

Using BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/New_York_City'
website_url = requests.get(url).text
soup = BeautifulSoup(website_url,'html.parser')
tables = soup.find_all('table')
for table in tables:
    if 'Racial composition' in table.text:
        headers = [header.text.strip() for header in table.find_all('th')]
        rows = []
        table_rows = table.find_all('tr')    
        for row in table_rows:
           td = row.find_all('td')
           row = [row.text for row in td]
           rows.append(row)

df = pd.DataFrame(rows, columns=headers)       

Output:

print (df)
                 Racial composition 2010[224] 1990[226]   1970[226] 1940[226]
0                             White     44.0%     52.3%       76.6%     93.6%
1                     —Non-Hispanic     33.3%     43.2%  62.9%[227]     92.0%
2         Black or African American     25.5%     28.7%       21.1%      6.1%
3  Hispanic or Latino (of any race)     28.6%     24.4%  16.2%[227]      1.6%
4                             Asian     12.7%      7.0%        1.2%         –