1
votes

I'm working on a project where I want to scrape NBA match statistics for the 2019/20 season from https://www.basketball-reference.com/leagues/NBA_2020_games.html for the months of October to August.

I focus solely on match outcomes for Home and Away teams and not player/team stats specifically and therefore I need box score data for every match using the tables "Basic Box Score Stats".

Problem: When scraping the box scores I only manage to gather the data for Away teams, since it's the first table in the box score link and I simply have to specify the table using the index [0] (it's static). For the Home team, the table index seems to change depending on whether there was Over Time (OT) or not - and sometimes due to other unspecified changes (it's somewhat dynamic).

Question: How can I best use a loop to gather box scores for both Away and Home teams in every month? Or, how do I collect data for the Home team in each box score?

Example of a box score page for a match without Over Time: https://www.basketball-reference.com/boxscores/201910220LAC.html

Example of a box score page for a match with Over Time: https://www.basketball-reference.com/boxscores/201910220TOR.html

In the latter example, the table-index for the Home team changes depending on the preceding number of tables (tables containing data on e.g. Over Time etc.). Usually it's the 8th table without OT and with OT its different.

My code that successfully (and consistently) gets the data for Away teams is the following:

box_score_example_url='http://www.basketball-reference.com//boxscores/201910230POR.html'
dfbox[]
for eachBox in box_score_example_url:
    dfz = pd.read_html(eachBox)
    dfbox.append(dfz[0])
    
boxbox_awayteam = pd.concat(dfbox)
boxbox_awayteam

I'm out of ideas for this one since no table seems to have a specific id or class in the HTML code. This is my first web scraping project and my first question posed on Stackoverflow, so bare with me.

1

1 Answers

0
votes

You can use BeautifulSoup and CSS selector [id$="-game-basic"] table to select only the two basic tables and then load these tables with pd.read_html():

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.basketball-reference.com/boxscores/201910220TOR.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

my_tables = soup.select('[id$="-game-basic"] table')

df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)

print(df_1)
print(df_2)

Prints:

                    Starters            MP  ...           PTS           +/-
0               Jrue Holiday         41:05  ...            13           -14
1             Brandon Ingram         35:06  ...            22           -19
2                J.J. Redick         27:03  ...            16           -14
3                 Lonzo Ball         24:50  ...             8            -7
4             Derrick Favors         20:46  ...             6           -12
5                   Reserves            MP  ...           PTS           +/-
6                  Josh Hart         28:10  ...            15            -1
7               Nicolò Melli         19:37  ...            14           +11
8           Kenrich Williams         18:02  ...             3           +11
9              Frank Jackson         13:51  ...             9            +7
10             Jahlil Okafor         12:29  ...             8            -7
11             E'Twaun Moore         12:06  ...             5            -1
12  Nickeil Alexander-Walker         11:55  ...             3            +6
13              Jaxson Hayes  Did Not Play  ...  Did Not Play  Did Not Play
14               Team Totals           265  ...           122           NaN

[15 rows x 21 columns]
           Starters            MP  ...           PTS           +/-
0        Kyle Lowry         44:59  ...            22            -1
1     Fred VanVleet         44:21  ...            34           +18
2     Pascal Siakam         38:09  ...            34            +5
3        OG Anunoby         35:48  ...            11           +12
4        Marc Gasol         31:55  ...             6            -2
5          Reserves            MP  ...           PTS           +/-
6     Norman Powell         28:38  ...             5            +2
7       Serge Ibaka         26:00  ...            13            +6
8     Terence Davis         15:10  ...             5             0
9       Matt Thomas  Did Not Play  ...  Did Not Play  Did Not Play
10    Chris Boucher  Did Not Play  ...  Did Not Play  Did Not Play
11  Stanley Johnson  Did Not Play  ...  Did Not Play  Did Not Play
12   Malcolm Miller  Did Not Play  ...  Did Not Play  Did Not Play
13  Dewan Hernandez  Did Not Play  ...  Did Not Play  Did Not Play
14      Team Totals           265  ...           130           NaN

[15 rows x 21 columns]

EDIT: To put this function in a loop, you can use this example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.basketball-reference.com/leagues/NBA_2020_games.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

def get_tables(url):
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')

    my_tables = soup.select('[id$="-game-basic"] table')

    df_1 = pd.read_html(str(my_tables[0]))[0].droplevel(0, axis=1)
    df_2 = pd.read_html(str(my_tables[1]))[0].droplevel(0, axis=1)

    return df_1, df_2

for a in soup.select('.filter a'):
    u = 'https://www.basketball-reference.com' + a['href']
    print(u)
    soup2 = BeautifulSoup(requests.get(u).content, 'html.parser')
    for a2 in soup2.select('td a[href^="/boxscores/"]'):
        u2 = 'https://www.basketball-reference.com' + a2['href']
        t1, t2 = get_tables(u2)
        print(u2)
        print(t1)
        print(t2)
        print('-' * 80)

Prints:

https://www.basketball-reference.com/leagues/NBA_2020_games-october.html
https://www.basketball-reference.com/boxscores/201910220TOR.html
                    Starters            MP  ...           PTS           +/-
0               Jrue Holiday         41:05  ...            13           -14
1             Brandon Ingram         35:06  ...            22           -19
2                J.J. Redick         27:03  ...            16           -14
3                 Lonzo Ball         24:50  ...             8            -7
4             Derrick Favors         20:46  ...             6           -12
5                   Reserves            MP  ...           PTS           +/-
6                  Josh Hart         28:10  ...            15            -1
7               Nicolò Melli         19:37  ...            14           +11
8           Kenrich Williams         18:02  ...             3           +11
9              Frank Jackson         13:51  ...             9            +7
10             Jahlil Okafor         12:29  ...             8            -7
11             E'Twaun Moore         12:06  ...             5            -1
12  Nickeil Alexander-Walker         11:55  ...             3            +6
13              Jaxson Hayes  Did Not Play  ...  Did Not Play  Did Not Play
14               Team Totals           265  ...           122           NaN

[15 rows x 21 columns]
           Starters            MP  ...           PTS           +/-
0        Kyle Lowry         44:59  ...            22            -1
1     Fred VanVleet         44:21  ...            34           +18
2     Pascal Siakam         38:09  ...            34            +5
3        OG Anunoby         35:48  ...            11           +12
4        Marc Gasol         31:55  ...             6            -2
5          Reserves            MP  ...           PTS           +/-
6     Norman Powell         28:38  ...             5            +2
7       Serge Ibaka         26:00  ...            13            +6
8     Terence Davis         15:10  ...             5             0
9       Matt Thomas  Did Not Play  ...  Did Not Play  Did Not Play
10    Chris Boucher  Did Not Play  ...  Did Not Play  Did Not Play
11  Stanley Johnson  Did Not Play  ...  Did Not Play  Did Not Play
12   Malcolm Miller  Did Not Play  ...  Did Not Play  Did Not Play
13  Dewan Hernandez  Did Not Play  ...  Did Not Play  Did Not Play
14      Team Totals           265  ...           130           NaN

[15 rows x 21 columns]
--------------------------------------------------------------------------------
https://www.basketball-reference.com/boxscores/201910220LAC.html
                    Starters            MP  ...           PTS           +/-
0              Anthony Davis         37:22  ...            25            +3
1               LeBron James         36:00  ...            18            -8
2                Danny Green         32:20  ...            28            +7


...and so on.