1
votes

I am using BeautifulSoup to scrape some basic information off of a bunch of Wikipedia pages. The program runs, but slowly (approx 20 minutes for 650 pages). I'm trying to use multiprocessing to speed this up, but it's not working as expected. It either seems to get held up and not do anything, or it only scrapes for the first letter of each page's name.

The scraping code I'm using is:

#dict where key is person's name and value is proper wikipedia url formatting
all_wikis = { 'Adam Ferrara': 'Adam_Ferrara',
              'Adam Hartle': 'Adam_Hartle',
              'Adam Ray': 'Adam_Ray_(comedian)',
              'Adam Sandler': 'Adam_Sandler',
              'Adele Givens': 'Adele_Givens'}
bios = {}
def scrape(dictionary):
    for key in dictionary:
        #search each page
        page = requests.get(("https://en.wikipedia.org/wiki/" + str(key)))
        data = page.text
        soup = BeautifulSoup(data, "html.parser")
        #get data
        try:
            bday = soup.find('span', attrs={'class' : 'bday'}).text
        except:
            bday = 'Birthday Unknown'
        try:
            birthplace = soup.find('div', attrs={'class' : 'birthplace'}).text
        except:
            birthplace = 'Birthplace Unknown'
        try:
            death_date = (soup.find('span', attrs={'style' : "display:none"}).text
                                                                            .replace("(", "")
                                                                            .replace(")", ""))
            living_status = 'Deceased'
        except:
            living_status = 'Alive'
        try:
            summary = wikipedia.summary(dictionary[key].replace("_", " "))
        except:
            summary = "No Summary"
        bios[key] = {}
        bios[key]['birthday'] = bday
        bios[key]['home_town'] = birthplace
        bios[key]['summary'] = summary
        bios[key]['living_status'] = living_status
        bios[key]['passed_away'] = death_date

I've tried to add processing to the end of the script using the code below but it doesn't work or only pulls the first letter of each page (for example, if the page I'm searching for is Bruce Lee, it would instead pull up the Wikipedia page for the letter B and then throw a bunch of errors).

from multiprocessing import Pool, cpu_count

if __name__ == '__main__':
    pool = Pool(cpu_count())
    results = pool.map(func=scrape, iterable=all_wiki)
    pool.close()
    pool.join()

Is there a better way to structure my script to allow for multiprocessing?

1
Welcome to StackOverflow ! I think you need to updated page = requests.get(("https://en.wikipedia.org/wiki/" + str(key))) to page = requests.get(("https://en.wikipedia.org/wiki/" + str(dictionary[key])))sam
Can you post your expected output structure for this example dict? The web scrape itself doesn't appear to be pulling correct data and I can offer a fix for this as well (my answer only discusses the MP problems). Thanks.ggorlen

1 Answers

0
votes

There are a few issues here:

  • dictionary is each string key in the all_wikis dict. When you then iterate through this string with for key in dictionary:, this accesses each character in the string. Your first request is to https://en.wikipedia.org/wiki/A, which is not the desired result.
  • str(key) isn't really helpful even if dictionary was a name. We need to look up the correct URL with all_wikis[name]. As an aside, avoid generic variables like dictionary.
  • Since you're multiprocessing, data like bios needs to be shared to be manipulated. Easiest is to just use the return value from the map function, which is an aggregate of all of the worker function return values.
  • There are logic issues with your scraping. wikipedia.summary is undefined. Without being sure of your exact desired outcome, it's reporting Adam Sandler as deceased. I'll leave this as an exercise for the reader since this question is mainly about multiprocessing.
  • I'm not sure if multiprocessing is as desirable here as multithreading. Since your process is going to be blocked making requests 99% of the time, I bet you can gain efficiency using many more threads (or processes) than the number of cores you have. Multiprocessing is more suitable when you're doing CPU-bound work, which is not the case here; very little time will actually be spent in the Python process itself. I'd recommend testing the code by increasing the processes (or threads, if you refactor for that) beyond the number of cores until you stop seeing improvements.

Here's some code to get you started. I stuck with multiprocessing per your example and didn't adjust the web scraping logic.

import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool, cpu_count

all_wikis = {'Adam Ferrara': 'Adam_Ferrara',
             'Adam Hartle': 'Adam_Hartle',
             'Adam Ray': 'Adam_Ray_(comedian)',
             'Adam Sandler': 'Adam_Sandler',
             'Adele Givens': 'Adele_Givens'}

def scrape(name):
    data = requests.get("https://en.wikipedia.org/wiki/" + all_wikis[name]).text
    soup = BeautifulSoup(data, "html.parser")
    bio = {}

    try:
        bio['birthday'] = soup.find('span', attrs={'class': 'bday'}).text
    except:
        bio['birthday'] = 'Birthday Unknown'

    try:
        bio['home_town'] = soup.find('div', attrs={'class': 'birthplace'}).text
    except:
        bio['home_town'] = 'Birthplace Unknown'

    try:
        bio['passed_away'] = (soup.find('span', attrs={'style': "display:none"}).text
                                                                        .replace("(", "")
                                                                        .replace(")", ""))
        bio['living_status'] = 'Deceased'
    except:
        bio['living_status'] = 'Alive'

    bio['summary'] = "No Summary"
    return name, bio


if __name__ == '__main__':
    pool = Pool(cpu_count())
    bios = dict(pool.map(func=scrape, iterable=all_wikis))
    pool.close()
    pool.join()
    print(bios)