1
votes

I'm working with BeautifulSoup to scrape an imdb webpage (https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=1). I've successfully scraped the name, year, intro, votes, director, etc. but having difficulties scraping "gross" and "actors".


<p class="sort-num_votes-visible">
                <span class="text-muted">Votes:</span>
                <span name="nv" data-value="591671">591,671</span>
    <span class="ghost">|</span>                <span class="text-muted">Gross:</span>
                <span name="nv" data-value="226,277,068">$226.28M</span>
        </p>

<p class="">
    Director:
<a href="/name/nm0003506/?ref_=adv_li_dr_0">James Mangold</a>
                 <span class="ghost">|</span> 
    Stars:
<a href="/name/nm0413168/?ref_=adv_li_st_0">Hugh Jackman</a>, 
<a href="/name/nm0001772/?ref_=adv_li_st_1">Patrick Stewart</a>, 
<a href="/name/nm6748436/?ref_=adv_li_st_2">Dafne Keen</a>, 
<a href="/name/nm2933542/?ref_=adv_li_st_3">Boyd Holbrook</a>
    </p>


Below are the code I used:

import requests
from bs4 import BeautifulSoup

directors=[]
actors=[]
votes=[]
grosses=[]

res_movie = requests.get('http://www.imdb.com/search/titlerelease_date='+'2018'+'&sort=num_votes,desc&page='+'1')
bs_movie = BeautifulSoup(res_movie.text,'html.parser')
movies=bs_movie.find_all('div', class_='lister-item mode-advanced')

for movie in movies:

    director=movie.find('p',class_='').find_all('a')[0].text
    directors.append(director)

    actors.append(movie.find('p',class_='').find_all('a')[1:].text) 

    vote=movie.find_all('span', attrs = {'name':'nv'})[0].text
    votes.append(vote)

    gross=movie.find_all('span', attrs = {'name':'nv'})[1].text
    grosses.append(gross)

The error I'm getting from actors:

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)
<ipython-input-70-a969b9a65fa7> in <module>
     60     directors.append(director)
     61 
---> 62     actors.append(movie.find('p',class_='').find_all('a')[:1].text)
     63 
     64 

AttributeError: 'list' object has no attribute 'text'

The error I'm getting from gross:

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)
<ipython-input-69-bd813766e1ca> in <module>
     74     votes.append(vote)
     75 
---> 76     gross=movie.find_all('span', attrs = {'name':'nv'})[1].text
     77     grosses.append(gross)
     78 # print(directors)

IndexError: list index out of range

I was hoping to use the list's index to get the element I desired. I would love to learn the proper method to obtain the element. Thanks so much in advance!!

1
Be aware scraping is against T&CQHarr

1 Answers

2
votes

Error on Actors:

find_all() returns list of found elements, so you need to iterate this list to get text of each element

Error on Gross:

For some movies, the Gross revenue doesn't exist, so we need to check for existence first.


Fixed version:

import requests
from bs4 import BeautifulSoup

directors=[]
actors=[]
votes=[]
grosses=[]

url = 'https://www.imdb.com/search/title/?release_date=2018&sort=num_votes,desc&page=1'
res_movie = requests.get(url)
bs_movie = BeautifulSoup(res_movie.text,'html.parser')
movies=bs_movie.find_all('div', class_='lister-item mode-advanced')

for movie in movies:
    director=movie.find('p',class_='').find_all('a')[0].text
    directors.append(director)

    actors.append([a.text for a in movie.find('p',class_='').find_all('a')[1:]])    # <-- using list comprehension

    nv = movie.find_all('span', attrs = {'name':'nv'})

    vote=nv[0].text
    votes.append(vote)

    gross= nv[1].text if len(nv) > 1 else '-'       # <-- check if Gross revenue exists for the movie
    grosses.append(gross)

# print the values:
for d, a, v, g in zip(directors, actors, votes, grosses):
    print('{:<22} {!s:<120} {:<12} {}'.format(d, a, v, g))

Prints:

Anthony Russo          ['Joe Russo', 'Robert Downey Jr.', 'Chris Hemsworth', 'Mark Ruffalo', 'Chris Evans']                                     734,642      $678.82M
Ryan Coogler           ['Chadwick Boseman', 'Michael B. Jordan', "Lupita Nyong'o", 'Danai Gurira']                                              557,058      $700.06M
David Leitch           ['Ryan Reynolds', 'Josh Brolin', 'Morena Baccarin', 'Julian Dennison']                                                   429,727      $324.59M
Bryan Singer           ['Rami Malek', 'Lucy Boynton', 'Gwilym Lee', 'Ben Hardy']                                                                398,775      $216.43M
John Krasinski         ['Emily Blunt', 'John Krasinski', 'Millicent Simmonds', 'Noah Jupe']                                                     339,291      $188.02M
Steven Spielberg       ['Tye Sheridan', 'Olivia Cooke', 'Ben Mendelsohn', 'Lena Waithe']                                                        324,204      $137.69M
James Wan              ['Jason Momoa', 'Amber Heard', 'Willem Dafoe', 'Patrick Wilson']                                                         317,403      $335.06M
Ruben Fleischer        ['Tom Hardy', 'Michelle Williams', 'Riz Ahmed', 'Scott Haze']                                                            316,446      $213.52M

...and so on.