Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

Question

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:

from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://www.tipico.de/de/live-wetten/"

try:
 page = urllib.request.urlopen(url)
except:
 print(“An error occured.”)

soup = BeautifulSoup(page, ‘html.parser’)

regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)

I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:

from bs4 import BeautifulSoup
import urllib.request
import re

url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"

try:
    page = urllib.request.urlopen(url)
except Exception as e:
    print(f"An error occurred: {e}")

soup = BeautifulSoup(page, 'html.parser')

regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)

The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line

print(soup)

When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?

EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.

"I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage." This is your actual problem, and is unrelated to BeautifulSoup. It happens often when building scrapers: the website does not send the same response to bots. Especially a betting site, they certainly must have anti-bot protection. The webpage you see from your browser is probably edited after the fact with Javascript. — LotB

Ananth Ananth · Accepted Answer · 2020-12-30T16:19:55

That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.

First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)

No <button> tag in the page source :

So I suggest using Selenium as the solution, and I tried a basic scrape of the website.

Here is the code I used :

from selenium import webdriver

option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'

browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)

browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")

span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
    print(span_tag.text)

browser.quit()

This is the output:

There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

1 Answers