0
votes

Link = https://www.imdb.com/search/title/?title_type=video_game&amp&sort=user_rating,desc&amp&after=1&amp&ref_=adv_nxt

My Goal

I need to collect all of the video game titles, genre, description, type, and release year on every page.

My Problem https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&start=9951&ref_=adv_nxt

total_games = 26,215

The "start=9951" changes to "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" on the next page iteration

I was originally going to loop: pages = np.arange(1, total_games, 50), every page from 1 to 26215 every 50 entries, but then I stumbled upon this problem.

HTML: < a href="/search/title/?title_type=video_game&sort=user_rating,desc&after=WzUuNSwidHQxODAxMDU0IiwxMDA1MV0%3D&ref_=adv_nxt" class="lister-page-next next-page">Next ยป< /a>

How do I take out a portion of the href link and add to the overall link to loop?

Outcome:

"https://www.imdb.com/search/title/?title_type=video_game&sort=user_rating,desc&" + "after=WzUuNSwidHQ4NjcxMDM2IiwxMDAwMV0%3D" + "&ref_=adv_nxt"

Bold: This is the part of HREF I want to grab on each page to iterate to the next page/This is inside the href that changes.

Any solutions would be greatly appreciated!

1

1 Answers

0
votes

You can save yourself the headache and simply check if the "Next" button exist in the HTML. If it does you just extract the href and follow the link else you've reached the last page.

Assuming you're using BeautifulSoup and you've prepared your soup:

next_link_tag = soup.find('a', {'class': 'next-page'}) # Find the a tag with a class "next-page"
if next_link_tag: # If there is any
    next_link = next_link_tag.get('href') # Get the href (Don't forget to prepend it with 'https://www.imdb.com/')
else:
    pass # There's no next page. Act accordingly