0
votes

I would like to extract for each movie at least 20 user reviews, but I don't know how to loop to get into the IMDb title movie and then to the user reviews with beautifulsoup.

start link = "https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250";

title_link(1) = "https://www.imdb.com/title/tt7131622/?ref_=adv_li_tt";

user_reviews_link_movie1 = "https://www.imdb.com/title/tt7131622/reviews?ref_=tt_ov_rt" ;

I am able to extract from a static page titles, years, ratings and metascores of each movie of the list.

# Import packages and set urls

from requests import get
url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
response = get(url)
print(response.text[:500])

from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)


movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

# Lists to store the scraped data in

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:

# The name
        name = container.h3.a.text
        names.append(name)
# The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
# The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
# The Metascore
        m_score = container.find('span', class_ = 'metascore').text
        metascores.append(int(m_score))

import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores})
test_df
  1. Actual results :

    movie year imdb metascore

    Once Upon a Time... in Hollywood (2019) (8.1) (83)

    Scary Stories (2019) (6.5) (61)

    Fast & Furious: Hobbs & Shaw (2019) (6.8) (60)

    Avengers: Endgame (2019) (8.6) (78)

  2. Expected :

    movie1 year1 imbd1 metascore1 review1

    movie1 year1 imbd1 metascore1 review2

    ...

    movie1 year1 imbd1 metascore1 review20

    movie2 year2 imbd2 metascore2 review1

    ...

    movie2 year2 imbd2 metascore2 review20

    ...

    movie250 year250 imbd250 metascore250 review20

1
Why would you want to repeat movie1 year1 imbd1 metascore1 20 times?Jack Fleeting
To get 20 reviews for each filmMan81
Yes, I get that, but it doesn't mean you have to repeat 20 items for 250 movies; not a database management expert, but you should probably think about doing it with two DFs, one for movies only and one for reviews only with the two related by a common key such as the movie name (if they are all unique) or a movie ID you assign to each and include in both DFs.Jack Fleeting
So taking into account comment above will it still be acceptable for you to repeat 20 times each film name and other characteristics in result dataframe?Dmitriy Fialkovskiy

1 Answers

0
votes

Assuming that answer on my question in comments is "yes".

Below is a solution to your initial request. There's a check whether a particular film really has 20 reviews. If less, then gather all available ones.

Technically parsing process is correct, I checked it when assigned movie_containers = movie_containers[:3]. Gathering all data will take some time.

UPDATE: just finished collecting info on all 250 films - everything is scraped without errors, so block after solution itself is just FYI.

Also if you want to go further with your parsing, I mean collect data for next 250 films and so on, you can add one more looping level to this parser. The process is similar to one in the "Reviews extracting" section.

# Import packages and set urls

from requests import get
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
url_header_for_reviews = 'https://www.imdb.com'
url_tail_for_reviews = 'reviews?ref_=tt_urv'
base_response = get(base_url)
html_soup = BeautifulSoup(base_response.text, 'html.parser')

movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

result_df = pd.DataFrame()

# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:

# Reviews extracting
        num_reviews = 20
        # Getting last piece of link puzzle for a movie reviews` link
        url_middle_for_reviews = container.find('a')['href']
        # Opening reviews page of a concrete movie
        response_reviews = get(url_header_for_reviews + url_middle_for_reviews + url_tail_for_reviews)
        reviews_soup = BeautifulSoup(response_reviews.text, 'html.parser')
        # Searching all reviews
        reviews_containers = reviews_soup.find_all('div', class_ = 'imdb-user-review')
        # Check if actual number of reviews is less than target one
        if len(reviews_containers) < num_reviews:
            num_reviews = len(reviews_containers)
        # Looping through each review and extracting title and body
        reviews_titles = []
        reviews_bodies = []
        for review_index in range(num_reviews):
            review_container = reviews_containers[review_index]
            review_title = review_container.find('a', class_ = 'title').text.strip()
            review_body = review_container.find('div', class_ = 'text').text.strip()
            reviews_titles.append(review_title)
            reviews_bodies.append(review_body)
# The name
        name = container.h3.a.text
        names = [name for i in range(num_reviews)]
# The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years = [year for i in range(num_reviews)]
# The IMDB rating
        imdb_rating = float(container.strong.text)
        imdb_ratings = [imdb_rating for i in range(num_reviews)]
# The Metascore
        metascore = container.find('span', class_ = 'metascore').text
        metascores = [metascore for i in range(num_reviews)]

# Gathering up scraped data into result_df
        if result_df.empty:
            result_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies})
        elif num_reviews > 0:
            result_df = result_df.append(pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores,'review_title': reviews_titles,'review_body': reviews_bodies}))

Btw I'm not sure that IMDB will let you gather data for all films in a loop as is. There's a possibility that you can get a captcha or redirection to some other page. If these issue appears,I'd go with a simple solution - pauses in scraping and/or changing user-agents.

Pause (sleep) can be implemented as follows:

import time
import numpy as np

time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds

Inserting a user-agent in request can be done as follows:

import requests 
from bs4 import BeautifulSoup

url = ('http://www.link_you_want_to_make_request_on.com/bla_bla')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

Google some other variants of user-agents, make a list from them and change them from time to time in next requests. Watch out though which user-agents you use - some of them indicate mobile or tablet devices, and for them a site (not only IMDB) can give response pages in a format that differs from PC one - other markup, other design etc. So in general above algorithm works only for PC version of pages.