Extract information from HTML using Mathematica

Question

Is there an easy way to extract data from specific HTML tables using Mathematica? Import seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.

Here's an example: http://en.wikipedia.org/wiki/Unemployment_by_country

IMO, if you're using version 8, JSON is the way to go. There are tons of API's in the wild (typically slinging XML or JSON your way). I wouldn't recommend killing time ripping unemployment data from a Wiki. Find the primary source for what you're interested in and it'll probably have an API. If you just want to rip something quickly, you might also try linked cells in Excel---then you can import into MMA. (Disregard all of this if you just wanna have fun and learn. In that case, parse away!!) :D — telefunkenvf14

Mike Honeychurch Mike Honeychurch · Accepted Answer · 2012-01-10T21:19:07

For general examples of this there are these How tos:

How to | Clean Up Data Imported from a ZIP File
How to | Clean Up Data Imported from a Website

For this specific example just import it

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

You will presumably want to remove the square bracket references (??):

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

Note also you can add the header back if you want it in your table, which you probably do

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.

Edit

if you wanted the flags you could also get them from CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

This will still produce some output that CountryData does not recognize.

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

6 misses out of 190. Remove those misses from the output:

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

Note that this takes a while to render.

enter image description here

You can obviously style the Grid as desired using Grid options and also resize the images if needed.

Extract information from HTML using Mathematica

6 Answers