I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.
On examination of the links I was able to find that the links for the actors look like this.
<a href="/name/nm0000638/?ref_=ttfc_fc_cl_t1"> William Shatner</a>
<a href="/name/nm0000559/?ref_=ttfc_fc_cl_t2"> Leonard Nimoy</a>
<a href="/name/nm0346415/?ref_=ttfc_fc_cl_t17"> Nicholas Guest</a>
while the ones for other contributors look like this
<a href="/name/nm0583292/?ref_=ttfc_fc_dr1"> Nicholas Meyer </a>
<a href="/name/nm0734472/?ref_=ttfc_fc_wr1"> Gene Roddenberry</a>
This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".
Here's the code I'm running.
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)
/name/nm0583292/
/name/nm0000638/
That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later. I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.
This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.
How can I make sure I get only the name IDs that I want (just the actor ones) from the page? (Also, feel free to offer suggestions to tighten up the code)
/name/nm0000638/?ref_=ttfc_fc_cl_t1
just looks like/name/nm0000638/
. You may need to think of another way of matching the actors, like for example simply only getting links in the cast sections? BS should make that rather straightforward. – Grismar