I am trying to scrape IMDB https://www.imdb.com/chart/top/?ref_=nv_mv_250. I want to write a loop to enter each film page by getting all the href attributes. However, the html code returned by urlopen shows broken href attributes (ignoring everythin after the question mark). Here are my code and result. Thank you so much in advance.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
html = urlopen(url)
bs = BeautifulSoup(html.read(),'html.parser')
table = bs.find('tbody',{'class':'lister-list'})
rows = table.find_all('tr')
for row in rows:
link = row.find('td',{'class':'titleColumn'}).find('a')['href']
print(link)
The result I get is something like this (ignoring everythin after the question mark)
/usr/local/Caskroom/miniconda/base/envs/web_scraping/bin/python /Users/gracezhou/Python/Python-Projects/scraping/imdb/test.py
/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
/title/tt0110912/
/title/tt0060196/
/title/tt0120737/
/title/tt0137523/
/title/tt0109830/
I wan to receive somethins like this:
/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=CPK54FS6SPX9EDAPBSJT&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1