2
votes

I'm trying to scrape some numbers from the Risk Statistics table on a yahoo finance webpage using BeautifulSoup and Python 2.7: https://finance.yahoo.com/quote/SHSAX/risk

enter image description here

So far, I've looked at the html using https://codebeautify.org:

enter image description here

#!/usr/bin/python
from bs4 import BeautifulSoup, Comment
import urllib

riskURL = "https://finance.yahoo.com/quote/SHSAX/risk"
page = urllib.urlopen(riskURL)
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')

My trouble is actually getting the numbers using soup.find. For example, standard deviation:

    # std should be 13.44
    stdevValue = float(soup.find("span",{"data-reactid":"124","class":"W(39%) Fl(start)"}).text)
    # std of category should be 0.18
    stdevCat = float(soup.find("span",{"data-reactid":"125","class":"W(57%) Mend(5px) Fl(end)"}).text)

Both of these calls to soup.find return none. What am I missing?

2

2 Answers

4
votes

From what I read on the web "data-reactid" is a custom attribute used by the react framework to reference components (you can read more here what's data-reactid attribute in html?) and after a couple of tries I noticed that on every reload of the page the data-reactid attributes are different, like random generated.

I think you should try find another approach to achieve this.

Maybe you can try to find a specific element like the "Standard Deviation" row, and then loop down to gather the data.

std_span = next(x for x in soup.find_all('span') if x.text == "Standard Deviation")
parent_div = std_span.parent
for sibling in parent_div.next_siblings:
   for child in sibling.children:
      # do something
      print(child.text)

Hope it helps.

1
votes
from bs4 import BeautifulSoup, Comment
import urllib


riskURL = "https://finance.yahoo.com/quote/SHSAX/risk"
page = urllib.request.urlopen(riskURL)
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
#W(25%) Fl(start) Ta(e)
results = soup.find("span", {"data-reactid" : "121"})
print results.text

Alternatively you can use a regex and findNext to get the value:

from bs4 import BeautifulSoup, Comment
import urllib


riskURL = "https://finance.yahoo.com/quote/SHSAX/risk"
page = urllib.request.urlopen(riskURL)
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
for span in soup.find_all('span',text=re.compile('^(Standard Deviation)')):
    print span.findNext('span').text