BeautifulSoup find_all('href') returns only part of the value

Question

I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.

On examination of the links I was able to find that the links for the actors look like this.

<a href="/name/nm0000638/?ref_=ttfc_fc_cl_t1"> William Shatner</a>
<a href="/name/nm0000559/?ref_=ttfc_fc_cl_t2"> Leonard Nimoy</a>
<a href="/name/nm0346415/?ref_=ttfc_fc_cl_t17"> Nicholas Guest</a>

while the ones for other contributors look like this

<a href="/name/nm0583292/?ref_=ttfc_fc_dr1"> Nicholas Meyer </a>
<a href="/name/nm0734472/?ref_=ttfc_fc_wr1"> Gene Roddenberry</a>

This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".

Here's the code I'm running.

import urllib.request
from bs4 import BeautifulSoup
import re

movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'

def clearLists(n):
    return [[] for _ in range(n)]

def getSoupObject(urlInput):
    page = urllib.request.urlopen(urlInput).read()
    soup = BeautifulSoup(page, features="html.parser")
    return(soup)

def getPeopleForMovie(soupObject):
    listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)

    #get all the tags with links in them
    link_tags = soupObject.find_all('a')

    #get the ids of people
    for linkTag in link_tags:
        link = str(linkTag.get('href'))
        #print(link)
        p = re.compile('t[0-9]+$')
        q = p.search(link)
        if link.startswith('/name/') and q != None:
            id = link[6:15]
            #print(id)
            listOfPeopleIDs.append(id)

    #return the names and IDs
    return listOfPeopleNames, listOfPeopleIDs

newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)

The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)

/name/nm0583292/
/name/nm0000638/

That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later. I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.

This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.

How can I make sure I get only the name IDs that I want (just the actor ones) from the page? (Also, feel free to offer suggestions to tighten up the code)

There's some comments to be made about the code, but most importantly, the html loaded by your code does not match the rendered html in the browser - it does not include the query parameters you are trying to match, so /name/nm0000638/?ref_=ttfc_fc_cl_t1 just looks like /name/nm0000638/. You may need to think of another way of matching the actors, like for example simply only getting links in the cast sections? BS should make that rather straightforward. — Grismar

Grismar Grismar · Accepted Answer · 2020-06-01T00:52:13

It appears that the links you are trying to match have either been modified by JavaScript after loading, or perhaps get loaded differently based on other variables than the URL alone (like cookies or headers).

However, since you're only after links of people in the cast, an easier way would be to simply match the ids of people in the cast section. This is actually fairly straightforward, since they are all in a single element, <table class="cast_list">

So:

import urllib.request
from bs4 import BeautifulSoup
import re

# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'


# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
#     return [[] for _ in range(n)]


# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
    page = urllib.request.urlopen(url_input).read()
    soup = BeautifulSoup(page, features="html.parser")
    # removed needless parentheses - arguably, even `soup` is superfluous:
    # return BeautifulSoup(page, features="html.parser")
    return soup


# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
    # removed unused variables, also 'list_of_people_ids' is needlessly verbose
    # since they go together, why not return people as a list of tuples, or a dictionary?
    # I'd prefer a dictionary as it automatically gets rid of duplicates as well
    people = {}

    # (put a space at the start of your comment blocks!)
    # get all the anchors tags inside the `cast_list` table
    link_tags = soup_object.find('table', class_='cast_list').find_all('a')

    # the whole point of compiling the regex is to only have to do it once, 
    # so outside the loop
    id_regex = re.compile(r'/name/nm(\d+)/')

    # get the ids and names of people
    for link_tag in link_tags:
        # the href attributes is a strings, so casting with str() serves no purpose
        href = link_tag.get('href')
        # matching and extracting part of the match can all be done in one step:
        match = id_regex.search(href)
        if match:
            # don't shadow Python keywords like `id` with variable names!
            identifier = match.group(1)
            name = link_tag.text.strip()
            # just ignore the ones with no text, they're the thumbs
            if name:
                people[identifier] = name

    # return the names and IDs
    return people


def main():
    # don't do stuff globally, it'll just cause problems when reusing names in functions
    soup = get_soup(url)
    people = get_people_for_movie(soup)
    print(people)


# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
    main()

Result:

{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.

And the code with a few more tweaks and without the commentary on your code:

import urllib.request
from bs4 import BeautifulSoup
import re


def get_soup(url_input):
    page = urllib.request.urlopen(url_input).read()
    return BeautifulSoup(page, features="html.parser")


def get_people_for_movie(soup_object):
    people = {}

    link_tags = soup_object.find('table', class_='cast_list').find_all('a')

    id_regex = re.compile(r'/name/nm(\d+)/')

    # get the ids and names of the cast
    for link_tag in link_tags:
        match = id_regex.search(link_tag.get('href'))
        if match:
            name = link_tag.text.strip()
            if name:
                people[match.group(1)] = name

    return people


def main():
    movie_number = 'tt0084726'
    url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'

    people = get_people_for_movie(get_soup(url))
    print(people)


if __name__ == '__main__':
    main()

BeautifulSoup find_all('href') returns only part of the value

1 Answers