I am using BeautifulSoup to scrape some basic information off of a bunch of Wikipedia pages. The program runs, but slowly (approx 20 minutes for 650 pages). I'm trying to use multiprocessing to speed this up, but it's not working as expected. It either seems to get held up and not do anything, or it only scrapes for the first letter of each page's name.
The scraping code I'm using is:
#dict where key is person's name and value is proper wikipedia url formatting
all_wikis = { 'Adam Ferrara': 'Adam_Ferrara',
'Adam Hartle': 'Adam_Hartle',
'Adam Ray': 'Adam_Ray_(comedian)',
'Adam Sandler': 'Adam_Sandler',
'Adele Givens': 'Adele_Givens'}
bios = {}
def scrape(dictionary):
for key in dictionary:
#search each page
page = requests.get(("https://en.wikipedia.org/wiki/" + str(key)))
data = page.text
soup = BeautifulSoup(data, "html.parser")
#get data
try:
bday = soup.find('span', attrs={'class' : 'bday'}).text
except:
bday = 'Birthday Unknown'
try:
birthplace = soup.find('div', attrs={'class' : 'birthplace'}).text
except:
birthplace = 'Birthplace Unknown'
try:
death_date = (soup.find('span', attrs={'style' : "display:none"}).text
.replace("(", "")
.replace(")", ""))
living_status = 'Deceased'
except:
living_status = 'Alive'
try:
summary = wikipedia.summary(dictionary[key].replace("_", " "))
except:
summary = "No Summary"
bios[key] = {}
bios[key]['birthday'] = bday
bios[key]['home_town'] = birthplace
bios[key]['summary'] = summary
bios[key]['living_status'] = living_status
bios[key]['passed_away'] = death_date
I've tried to add processing to the end of the script using the code below but it doesn't work or only pulls the first letter of each page (for example, if the page I'm searching for is Bruce Lee, it would instead pull up the Wikipedia page for the letter B and then throw a bunch of errors).
from multiprocessing import Pool, cpu_count
if __name__ == '__main__':
pool = Pool(cpu_count())
results = pool.map(func=scrape, iterable=all_wiki)
pool.close()
pool.join()
Is there a better way to structure my script to allow for multiprocessing?
page = requests.get(("https://en.wikipedia.org/wiki/" + str(key)))
topage = requests.get(("https://en.wikipedia.org/wiki/" + str(dictionary[key])))
– sam