0
votes

I need to scrape http://www.vintagetoday.be/fr/montres but it has dynamic content.

How can I do this?

my code

import requests from bs4 import BeautifulSoup t = requests.get("vintagetoday.be/fr/catalogue.awp").text print(len(BeautifulSoup(t, "lxml").findAll("td", {"class":"Lien2"})))

results is 16 but thera are 430 articles

2
Why without selenium?G_M
it should't be done with selenium pleaseGayan Jeewantha
Which exact data you want to scrape and what have you already tried? Share your current code along with problem description/exception logAndersson
i need links(watches) like this, [link]vintagetoday.be/fr/…, [link]vintagetoday.be/fr/…. my code is import requests from bs4 import BeautifulSoup t = requests.get("vintagetoday.be/fr/catalogue.awp").text print(len(BeautifulSoup(t, "lxml").findAll("td", {"class":"Lien2"}))) results is 16 but thera are 405 articlesGayan Jeewantha

2 Answers

0
votes

It's normal that you're getting just 16 links instead of 430, when the page is loaded for the first time it only comes with the first 16 watches (links) in order to get more you need to scroll down the page and more watches will appear, To achieve this you can use Selenium.

A better method will be to reverse the AJAX call they are using to load the watches (paginate) and use this call directly in your code. A quick look shows that they call the following URL to load more watches (POST):

http://www.vintagetoday.be/fr/montres?AWPIDD9BBA1F0=27045E7B002DF1FE7C1BA8D48193FD1E54B2AAEB

I don't see any parameter that indicates the pagination tho, which means it's stored in the session, they also send some query string parameter with the request's body, so you need to check that as well.

The return value seems to be in XML, which will be straightforward to get the URLs from.

0
votes

I'm definitely NOT an expert with this stuff, but I think this is what you want.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("http://www.vintagetoday.be/fr/montres")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
print(links)

See the two links below for more info.

https://pythonspot.com/extract-links-from-webpage-beautifulsoup/

https://pythonprogramminglanguage.com/get-links-from-webpage/