0
votes

I am trying use BeautifulSoup to build a scraper that will pull box scores off of www.basketball-reference.com. An example box score page would be this. The box score tables that I want are under a table tag have an id that contains the word 'basic' (this distinguishes it from the advanced stats tables). I figured a function would be best for picking out this distinction. Html looks like this.

My code:

r = requests.get(https://www.basketball-reference.com/boxscores/202003110ATL.html).content
soup = BeautifulSoup(r, 'lxml')

def get_boxscore_basic_table(tag):
    return ('basic' in tag.attrs['id']) and ('sortable' in tag.attrs['class'])

tables = soup.find_all(get_boxscore_basic_table)

This throws the: "KeyError 'id'" and I am confused on how to fix this. I've checked the keys by grabbing just the first instance using .find():

table = soup.find('table')
print('table.attrs')

And the key 'id' is there. Why can't it locate my specific request when searching through the whole html and how can I fix this?

2
It would appear that the tag does not have an attribute named id. When you do a dictionary lookup for a key that does not exist, you get key error. If you are expecting to manage cases where the tag does not have an id, use a try except - Sri
@Sri I guess I am confused on how BeautifulSoups tag object works when using a function in the .find_all() method. Why wouldn't it be able to locate the specific tags with an id attribute that contains the word 'basic'. I thought that is why they made this functionality so it could be specific it its search. - Jacob Garwin
Oh I see, you can do that. Can you help me by pointing out which element on the page you are trying to select? For example, give me a full id of the tag - Sri
@Sri id="box-NYK-game-basic" is the full id attribute. The full table tag is in the hyperlinked photo in the question description. I chose to look for 'basic' within the Id tags because the actual team depending on the boxscore will change. - Jacob Garwin
I posted a solution using css selector, which should suit your needs. - Sri

2 Answers

0
votes

You were quite close! The issue is that some elements don't have an id and class, which leads to an error when you try to access the missing attribute(s).

This should work correctly:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.basketball-reference.com/boxscores/202003110ATL.html")
soup = BeautifulSoup(r.content, 'lxml')


def valid_boxscore_basic_table_elem(tag):
    tag_id = tag.get("id")
    tag_class = tag.get("class")
    return (tag_id and tag_class) and ("basic" in tag_id and "sortable" in tag_class)


tables = soup.find_all(valid_boxscore_basic_table_elem)

print(tables)

Be careful when using in, though, remember that "cat" in "caterpillar" is True.


The code can be simplified and made more versatile through the use of some basic regex:

import re

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.basketball-reference.com/boxscores/202003110ATL.html")
soup = BeautifulSoup(r.content, 'lxml')

valid_id_re = re.compile(r"-basic$")
valid_class_re = re.compile(r" ?sortable ?")

tables = soup.find_all("table", attrs={"id": valid_id_re.search, "class": valid_class_re.search})
0
votes

You can try this, it uses a CSS selector to find an id containing basic and a class containing sortable

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.basketball-reference.com/boxscores/202003110ATL.html').content
soup = BeautifulSoup(r, 'html.parser')
print(soup.select('table[id*="basic"][class*="sortable"]'))