1
votes

I'm trying the extract the value of "Next 5 Years (per annum)" for the stock BABA from the Yahoo Finance "Analysis" tab : https://finance.yahoo.com/quote/BABA/analysis?p=BABA. (It's 2.85% the second row from the bottom).

I have been trying to use those questions:

Scrape Yahoo Finance Financial Ratios

Scrape Yahoo Finance Income Statement with Python

But I can't even extract the data from the page

tried this website as well :

https://hackernoon.com/scraping-yahoo-finance-data-using-python-ayu3zyl

This is the I code wrote the get the web page data

First import the packages:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

Then trying to extract the data from the page:

Url= "https://finance.yahoo.com/quote/BABA/analysis?p=BABA"
r = requests.get(Url)
data = r.text
soup = BeautifulSoup(data,features="lxml")

When looking at type of "data" and "soup" objects I see that

type(data)
<class 'str'>

I can extract somehow the needed data of the row of ">Next 5 Years" using regular expressions.

But when when looking at

type(soup)
<class 'bs4.BeautifulSoup'>

And the data in it is not relevant to the page for some reason.

looks like that (copied only small part of what in the soup object):

soup
<!DOCTYPE html>
<html class="NoJs featurephone" id="atomic" lang="en-US"><head prefix="og: 
http://ogp.me/ns#"><script>window.performance && window.performance.mark &&  
window.performance.mark('PageStart');</script><meta charset="utf-8"/> 
<title>Alibaba Group Holding Limited (BABA) Analyst Ratings, Estimates &amp; 
Forecasts - Yahoo Finance</title><meta con 
tent="recommendation,analyst,analyst 
rating,strong buy,strong 
sell,hold,buy,sell,overweight,underweight,upgrade,downgrade,price target,EPS 
estimate,revenue estimate,growth estimate,p/e 
estimate,recommendation,analyst,analyst rating,strong buy,strong 
sell,hold,buy,sell,overweight,underweight,upgrade,downgrade,price target,EPS 
estimate,revenue estimate,growth estimate,p/e estimate" name="keywords"/> 
<meta   content="on" http-equiv="x-dns-prefetch-control"/><meta content="on" 
property="twitter:dnt"/><meta content="90376669494" property="fb:app_id"/> 
<meta content="#400090" name="theme-color"/><meta content="width=device- 
width, 
  1. Is there any other way to extract the needed data that is NOT regular expressions from the object data ?
  2. How the soup object helps me extract the data (I see it is used a lot, but not sure how to make useful) ?

Thanks in Advance

1
You can use the soup object to find the respective <div> element on your page. Follow the steps described in this tutorial: link. - GinTonic
I tried the link but my soup object doesn't look like in the link you added (I made some editing to the question so you can see how is looks). The soup object don't seem to have the data from the page while for some reason the data object, do contain the information from the page - TaL

1 Answers

0
votes

One solution is to extract the value from the JSON data in the JS using a regex. The JSON data is located in the following variable :

root.App.main = { .... };

Example :

import requests 
import re
import json

r = requests.get("https://finance.yahoo.com/quote/BABA/analysis?p=BABA")

data = json.loads(re.search('root\.App\.main\s*=\s*(.*);', r.text).group(1))

field = [t for t in data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]["earningsTrend"]["trend"] if t["period"] == "+5y" ][0]

print(field)
print("Next 5 Years (per annum) : " + field["growth"]["fmt"])