1
votes

I am trying to scrape projects' URLs from the Kickstarter webpage using Beautiful Soup. I am using the following code:

import requests
from bs4 import BeautifulSoup

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

project_name_list = soup.find(class_='grid-row flex flex-wrap')

project_name_list_items = project_name_list.find_all('a')
print(project_name_list_items)

for project_name in project_name_list_items:
    links = project_name.get('href')
    print(links)

But this is what I get as output:

[<a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>, <a class="block img-placeholder w100p"><div class="img-placeholder bg-grey-400 absolute t0 w100p"></div></a>]
None
None
None
None
None
None

I tried several ways, such as:

for link in soup.find_all('a'):
    print(link.get('href'))

But still no results. Also, this page that I am scraping has a "Load more" part at the end of the page. How can I get the URLs in that part? I appreciate your help.

1

1 Answers

3
votes

The data is not embedded in the html itself but as JSON in an html attribute called data-project. One solution is to use a find_all("div") and check out only those that have that attribute

Also while the url is present in JSON, there is a query parameter named ref that is present in another html attribute called data-ref. The following get all the links for page 1

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.kickstarter.com/discover/advanced?category_id=28&staff_picks=1&sort=newest&seed=2639586&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

data = [
    (json.loads(i["data-project"]), i["data-ref"])
    for i in soup.find_all("div")
    if i.get("data-project")
]

for i in data:
    print(f'{i[0]["urls"]["web"]["project"]}?ref={i[1]}')

Then you can iterate the pages ("Load more" button) by just incrementing the page query parameter