7
votes

I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.

It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:

http://www.pro-football-reference.com/boxscores/201609080den.htm

import requests, bs4

source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
    raise Exception('No data found for this link: '+source_url)

soup = bs4.BeautifulSoup(res.text,'html.parser')

#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))

#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))

Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.

Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.

If it is something with the HTML, is there any better tool/library for helping to extract this info out there?

Thank you for your help, BF

2
what precisely do you want to parse from that table? All table? only several columns? several cells?Dmitriy Fialkovskiy
your statement table = soup.findAll('table', {'id' : 'pbp'}) isn't not working, it simply doesn't find div elements with id = pbpDmitriy Fialkovskiy
@DmitriyFialkovskiy I am trying to ultimately create an excel file of the plays from particular games. Once I can zero soup in on that table, I am confident I can loop to through tr and td tags to get the text out of it and use openpyxl to get it into excel. I guess ultimately my question is why doesn't bs4 find the tag in the html. It seems bs4 can find any tags before the comment in the html but not after - does the comment impact the parsing? is there any way to pull tags from after that comment accurately?Big Fore
If so, the javascript on the page will need to load first prior to scraping. This post seems to have a method of doing so - stackoverflow.com/questions/8049520/….qwertyuip9
@qwertyuiop9 Thank You! That is exactly what I was looking for. I didn't realize the soup might not contain all of the html i was viewing through the browser. I will play around with Selenium or Dryscrape and see what I can figure out. Thanks againBig Fore

2 Answers

3
votes

BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.

As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!

0
votes

Ok, I got what was the problem. You're trying to parse comment, not an ordinary html element. For such cases you should use Comment from BeautifulSoup, like this:

import requests
from bs4 import BeautifulSoup,Comment

source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
    raise Exception('No data found for this link: '+source_url)

soup = BeautifulSoup(res.content,'html.parser')

comments=soup.find_all(string=lambda text:isinstance(text,Comment))

for comment in comments:
    comment=BeautifulSoup(str(comment), 'html.parser')
    search_play = comment.find('table', {'id':'pbp'})
    if search_play:
        play_to_play=search_play