1
votes

I am new to web scraping and I'm trying to scrape the "statistics" page of yahoo finance for AAPL. Here's the link: https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL

Here is the code I have so far...

from bs4 import BeautifulSoup
from requests import get


url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stock_data = soup.find_all("table")

for stock in stock_data:
    print(stock.text)

When I run that, I return all of the table data on the page. However, I only want specific data from each table (e.g. "Market Cap", "Revenue", "Beta").

I tried messing around with the code by doing print(stock[1].text) to see if I could limit the amount of data returned to just the second value in each table but that returned an error message. Am I on the right track by using BeautifulSoup or do I need to use a completely different library? What would I have to do in order to only return particular data and not all of the table data on the page?

2

2 Answers

2
votes

Examining the HTML-code gives you the best idea of how BeautifulSoup will handle what it sees.

The web page seems to contain several tables, which in turn contain the information you are after. The tables follow a certain logic.

First scrape all the tables on the web page, then find all the table rows (<tr>) and the table data (<td>) that those rows contain.

Below is one way of achieving this. I even threw in a function to print only a specific measurement.

from bs4 import BeautifulSoup
from requests import get


url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one

for table in stock_data:
    # Scrape all table rows into variable trs
    trs = table.find_all('tr')
    for tr in trs:
        # Scrape all table data tags into variable tds
        tds = tr.find_all('td')
        # Index 0 of tds will contain the measurement
        print("Measure: {}".format(tds[0].get_text()))
        # Index 1 of tds will contain the value
        print("Value: {}".format(tds[1].get_text()))
        print("")


def get_measurement(table_array, measurement):
    for table in table_array:
        trs = table.find_all('tr')
        for tr in trs:
            tds = tr.find_all('td')
            if measurement.lower() in tds[0].get_text().lower():
                return(tds[1].get_text())


# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))
0
votes

Although this isn't Yahoo Finance, you can do something very similar like this...

import requests
from bs4 import BeautifulSoup

base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})

light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")

data = []
for rows_set in (light_rows, dark_rows):
    for row in rows_set:
        row_data = []
        for cell in row.find_all('td'):
            val = cell.a.get_text()
            row_data.append(val)
        data.append(row_data)

#   sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))

import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)

This is a nice substitute in case Yahoo decided to depreciate more of the functionality of their API. I know they cut out a lot of things (mostly historical quotes) a couple years ago. It was sad to see that go away.

enter image description here

enter image description here