0
votes

I have some Python code which scrapes the game logs of NBA players for a given season (for instance: the data here) into a csv file. I'm using Beautiful Soup. I am aware that there is an option to just get csv version by clicking on a link on the website, but I am adding something to each line, so I feel like scraping line by line is the easiest option. The goal is to eventually write code that does this for every season of every player.

The code looks like this:

import urllib
from bs4 import BeautifulSoup

def getData(url):
    html = urllib.urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    type(soup)

    file = open('/Users/Mika/Desktop/a_players.csv', 'a')
    for table in soup.find_all("pre", class_ = ""):
        dataline = table.getText
        player_id = player_season_url[47:-14]
        file.write(player_id + ',' + dataline + '\n')
    file.close()

player_season_url = "https://www.basketball-reference.com/players/a/abdelal01/gamelog/1991/"
getData(player_season_url)

The problem is this: as you can see from inspecting the element of the URL, some cells in the table have empty values.

<td class="right " data-stat="fg3_pct"></td>

(this is an example of a good cell with a value ("1") in in that is properly scraped):

<th scope="row" class="right " data-stat="ranker" csk="1">1</th>

When scraping, the rows come out uneven, skipping over the empty values to create a csv file with the values out of place. Is there a way to ensure that those empty values get replaces with " " in the csv file?

1
You should add a few lines of html to your question to make your minimal reproducible example complete - format it as code.. The example html should have a mixture of good cells and bad cells.wwii
And you can't add to the csv after if downloaded by link?QHarr

1 Answers

1
votes

For writing csv files Python has builtin support (module csv). For grabbing whole table from the page you could use script like this:

import requests
from bs4 import BeautifulSoup
import csv
import re

def getData(url):
    soup = BeautifulSoup(requests.get(url).text, 'lxml')

    player_id = re.findall(r'(?:/[^/]/)(.*?)(?:/gamelog)', url)[0]

    with open('%s.csv' % player_id, 'w') as f:
        csvwriter = csv.writer(f, delimiter=',', quotechar='"')
        d = [[td.text for td in tr.find_all('td')] for tr in soup.find('div', id='all_pgl_basic').find_all('tr') if tr.find_all('td')]
        for row in d:
            csvwriter.writerow([player_id] + row)

player_season_url = "https://www.basketball-reference.com/players/a/abdelal01/gamelog/1991/"
getData(player_season_url)

Output is in CSV file (I added from LibreOffice):

enter image description here

Edit:

  • extracted player_id from URL
  • file is saved in {player_id}.csv