1
votes

I am trying to scrape data from this webpage:

https://www.premierleague.com/players/4330/player/stats?co=1&se=79

Specifically the four middle numbers (appearances, clean sheets,...) for each season separately (see the dropdown). However, so far I only get the numbers aggregate for all seasons. I use selenium, because beautifulsoup alone could not do it, but selenium does not seem do it either. This is the relevant part of the code (it is in for loop, taking urls from a csv file):

browser = webdriver.Chrome('C:\chromedriver.exe')
browser.get('https://www.premierleague.com/players/4330/player/stats?co=1&se=79')

wait = WebDriverWait(browser, 10)
wait.until(
    EC.element_to_be_clickable(
        (By.XPATH, "//*[@role='button'][text()='2017/18']")))

html = browser.page_source
soup = bs(html, 'lxml')

Printed tree only has the "all seasons" numbers, although the loaded page in chrome shows only 2017/2018 season. Does anybody has the idea why? The scraping visibly happens after the dropdown is loaded, but it still gives the values relevant before it is loaded.

The dropdown looks like this:

enter image description here

1
Is selenium not able to do that? The dropdown looks like this: <ul class="dropdownList" ... and the options like this: <li role="option" tabindex="0" data-option-name="All Seasons" data-option-id="-1" data-option-index="-1">All Seasons</li>Michal A.
selenium could do it but scrapy+splash+eventually bs is the state of the art.Lore

1 Answers

0
votes

You're getting the page_source the moment the URL is fetched, which means you'll likely see only and exactly what the server sends to the browser—no more and no less. That initial source includes the following HTML snippet:

<span class="stat">
  Appearances
  <span class="allStatContainer statappearances" data-stat="appearances">230</span>
</span>

It isn't until a few moments later, after some JavaScript has downloaded and executed, that it changes to the following:

<span class="stat">
  Appearances
  <span class="allStatContainer statappearances" data-stat="appearances">30</span>
</span>

In order to get that data, then, you'll need to wait for it. That means you'll need to wait for some indication that the necessary JavaScript has executed. If you can find something that (1) consistently appears after the JavaScript has executed and (2) is a constant, predictable value, you can use Selenium's WebDriverWait() to wait for it. Then you'll know that it's safe to fetch the data you want.

In your case, it looks like you want to wait until the "Filter by Season" dropdown has appeared and is populated and its target button is displaying the "2017/18" season:

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait

wait = WebDriverWait(driver, 10)
wait.until(
    EC.element_to_be_clickable(
        (By.XPATH, "//*[@role='button'][text()='2017/18']")))