0
votes

When I was using Beautifulsoup and requests module to scrape the img's src, all the img s src are empty so then I'm assuming that the src value is generated by JavaScript. Hence, I tried to use the requests_html module instead. However, when I trying to scrape the same information after the response is rendered, only two of the img 's src has value and the rest are empty but the problem is that when I checked it on the website using developer tools, it seems that the other img's src should have a value. May I know what is the problem here?

code for bs4 and requests

from bs4 import BeautifulSoup
import requests

biliweb = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/3').text

bilisoup = BeautifulSoup(biliweb,'lxml')

for item in bilisoup.find_all('div',class_='lazy-img'):
    
    image_html = item.find('img')
    print(image_html)

code for requets_html

from requests_html import HTML, HTMLSession

session = HTMLSession()

biliweb =  session.get('https://www.bilibili.com/ranking/bangumi/13/0/3')
biliweb.html.render() 


for item in biliweb.html.find('.lazy-img.cover > img'):
    print(item.html)

I will only show the first five results because the list is quite lengthy

With Beautifulsoup and requests

<img alt="Re:从零开始的异世界生活 第二季" src=""/>
<img alt="刀剑神域 爱丽丝篇 异界战争 -终章-" src=""/>
<img alt="没落要塞 / DECA-DENCE" src=""/>
<img alt="某科学的超电磁炮T" src=""/>
<img alt="宇崎学妹想要玩!" src=""/>

With requests_html

<img alt="Re:从零开始的异世界生活 第二季" src="https://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png@90w_120h.webp"/>
<img alt="刀剑神域 爱丽丝篇 异界战争 -终章-" src="https://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png@90w_120h.webp"/>
<img alt="没落要塞 / DECA-DENCE" src=""/>
<img alt="某科学的超电磁炮T" src=""/>
<img alt="宇崎学妹想要玩!" src=""/>

1

1 Answers

0
votes

All the data is stored in a javascript variable called __INITIAL_STATE__.

The following script saves the data in a json file. Once you have this, you can easily download the images.


import requests, json
from bs4 import BeautifulSoup

page = requests.get('https://www.bilibili.com/ranking/bangumi/13/0/3')
soup = BeautifulSoup(page.content, 'html.parser')

script = None
for s in soup.find_all("script"):
    if "__INITIAL_STATE__" in s.text:
        script = s.get_text(strip=True)
        break

data = json.loads(script[script.index('{'):script.index('function')-2])

with open("data.json", "w") as f:
    json.dump(data, f)

print(data)

Output:

{'rankList': [{'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/2f5bf4840747fc7c09932d2793e96a178cd05905.jpg', 'index_show': '更新至第5话'}, 'pts': 1903981, 'rank': 1, 'season_id': 33802, 'stat': {'danmaku': 814356, 'follow': 7135303, 'series_follow': 7267882, 'view': 33685387}, 'title': 'Re:从零开始的异世界生活 第二季', 'url': 'https://www.bilibili.com/bangumi/play/ss33802', 'pic': 'http://i0.hdslb.com/bfs/bangumi/image/f2425cbdb07cc93bd0d3ba1c0099bfe78f5dc58a.png', 'play': 33685387, 'video_review': 814356}, {'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/a772451f1f031ee1a3b78e31e4fb0b851517817f.jpg', 'index_show': '更新至第16话'}, 'pts': 483317, 'rank': 2, 'season_id': 32781, 'stat': {'danmaku': 514174, 'follow': 6195736, 'series_follow': 6733547, 'view': 36351270}, 'title': '刀剑神域 爱丽丝篇 异界战争 -终章-', 'url': 'https://www.bilibili.com/bangumi/play/ss32781', 'pic': 'http://i0.hdslb.com/bfs/bangumi/image/54d9ca94ca84225934e0108417c2a1cc16be38fb.png', 'play': 36351270, 'video_review': 514174}, {'badge': '会员抢先', 'badge_info': {'bg_color': '#FB7299', 'bg_color_night': '#BB5B76', 'text': '会员抢先'}, 'badge_type': 0, 'copyright': 'bilibili', 'cover': 'http://i0.hdslb.com/bfs/bangumi/image/d5d7441c20614dc5ddc69f333f1906a09eddcee2.png', 'new_ep': {'cover': 'http://i0.hdslb.com/bfs/archive/fe191e9ffa2422103bffcd8615446f5885074c0b.jpg', 'index_show': '更新至第5话'}, 'pts': 455170, 'rank': 3, 'season_id': 33803, 'stat': ....
...
...
...