jsoup - extract text from wikipedia article

Question

I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. How can I use JSoup to extract all the text of a Wikipedia article (for example all the text in http://en.wikipedia.org/wiki/Boston)?

Is parsing the text with jsoup part of the interesting problem? Because if not, you should just use the action=raw parameter to get the source for each page. e.g. en.wikipedia.org/w/index.php?title=Elephant&action=raw — beerbajay
use this, it's more robust and esier on the wikipedia servers too: trulymadlywordly.blogspot.com/2011/03/… — Maarten

Hauke Ingmar Schmidt Hauke Ingmar Schmidt · Accepted Answer · 2012-02-05T17:03:35

Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").get();
Element contentDiv = doc.select("div[id=content]").first();
contentDiv.toString(); // The result

You retrieve formatted content this way, of course. If you want "raw" content you can filter the result with Jsoup.clean or use the call contentDiv.text().

jsoup - extract text from wikipedia article

3 Answers