0
votes

Is there any way to extract

Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: ˈalbɐt ˈaɪnʃtaɪn; 14 March 1879– 18 April 1955) was a German-born theoretical physicist who developed the theory of general relativity, effecting a revolution in physics. ............. with over 150 non-scientific works. [6][8] His great intelligence and originality have made the word "Einstein" synonymous with genius. [9]

(The whole output of the main paragraph, visible if the code is run)

Automatically from the output of the following code? Even if it is output from a different wikipedia page:

import urllib2
import re, sys
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def stripHTMLTags(html):
    html = re.sub(r'<{1}br{1}>', '\n', html)
    s = MLStripper()
    s.feed(html)
    text = s.get_data()
    if "External links" in text:
        text, sep, tail = text.partition('External links')
    if "External Links" in text:
        text, sep, tail = text.partition('External Links')
    text = text = text.replace("See also","\n\n See Also - \n")
    text = text.replace("*","- ")
    text = text.replace(".", ". ")
    text = text.replace("  "," ")
    text = text.replace("""   /
 / ""","")
    return text

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
print stripHTMLTags(page)

Please excuse my poor formatting, code (and possibly indentation), I'm using a 3" display right now and haven't had a chance to go over my own code :P.

Thanks also to the people who's posts have helped me to get this working :)

3

3 Answers

3
votes

I'd strongly advise against html-scraping for any site.

It's painful to do, it will break easily and a lot of site owners don't like it.

Use this (python-wikitools) to interface with the Wikipedia API (your best choice in the long run).

-1
votes

I leave my answer here because it is directly what the OP asked for. The proper way to do this is to use python-wikitools as suggested in the answer by @ChristophD below.


I have slightly modified the code in your question to use BeautifulSoup. Other options exist. You may also want to try lxml.

import urllib2
import re, sys
from HTMLParser import HTMLParser

# EDIT 1: import the packag
from BeautifulSoup import BeautifulSoup

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def stripHTMLTags(html):
    html = re.sub(r'<{1}br{1}>', '\n', html)
    s = MLStripper()
    s.feed(html)
    text = s.get_data()
    if "External links" in text:
        text, sep, tail = text.partition('External links')
    if "External Links" in text:
        text, sep, tail = text.partition('External Links')
    text = text = text.replace("See also","\n\n See Also - \n")
    text = text.replace("*","- ")
    text = text.replace(".", ". ")
    text = text.replace("  "," ")
    text = text.replace("""   /
 / ""","")
    return text

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

# EDIT 2: convert the page and extract text from the first <p> tag
soup = BeautifulSoup(page)
para = soup.findAll("p", limit=1)[0].text

print stripHTMLTags(para)