18
votes

Is there an easy way to extract data from specific HTML tables using Mathematica? Import seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.

Here's an example: http://en.wikipedia.org/wiki/Unemployment_by_country

6
IMO, if you're using version 8, JSON is the way to go. There are tons of API's in the wild (typically slinging XML or JSON your way). I wouldn't recommend killing time ripping unemployment data from a Wiki. Find the primary source for what you're interested in and it'll probably have an API. If you just want to rip something quickly, you might also try linked cells in Excel---then you can import into MMA. (Disregard all of this if you just wanna have fun and learn. In that case, parse away!!) :D - telefunkenvf14

6 Answers

13
votes

For general examples of this there are these How tos:

For this specific example just import it

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

You will presumably want to remove the square bracket references (??):

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

Note also you can add the header back if you want it in your table, which you probably do

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.

Edit

if you wanted the flags you could also get them from CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

This will still produce some output that CountryData does not recognize.

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

6 misses out of 190. Remove those misses from the output:

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

Note that this takes a while to render.

enter image description here

You can obviously style the Grid as desired using Grid options and also resize the images if needed.

6
votes

While the use of Import is probably a better and more robust way, I found that, at least for this particular problem, my own HTML parser (published in this thread), works fine with a small amount of post-processing. If you take the code from there and execute it, augmenting it with this function:

Clear[findAndParseTables];
findAndParseTables[text_String] :=
  Module[{parsed = postProcess@parseText[text]},
    DeleteCases[
      Cases[parsed, _tableContainer, Infinity],
      _attribContainer | _spanContainer, Infinity
    ] //.
    {(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
        iContainer[x___] :> x,
        aContainer[x_] :> x,
        "\n" :> Sequence[],
       divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];

Then you get, I think, a pretty much complete data by this code:

text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];

Here is how the result looks:

In[92]:= Short[myData,5]
Out[92]//Short= 
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]

What I like about this approach (as opposed to say, Import->XMLObject) is that, since I convert the web page into Mathematica expression with minimal syntax (unlike e.g. XML objects), it is often very easy to establish a set of replacement rules which does the right post-processing in each given case. A final disclaimer is that my parser is not robust and does for sure contain a number of bugs, so be warned.

6
votes

Not a direct answer to how to import HTML (which others have explained nicely), but getting data from HTML tables is precisely why I originally made my table paste palette.

If your aim is to just get the data, this is probably going to be easier and faster than trying to parse the page.

Instructions on using the palette

  1. Evaluate the expression that creates the palette, go to Palettes -> Install Palette... and save it permanently for later use (if you wish).

  2. Select a part of the table on the webpage. If you are working with Firefox, hold down CTRL to select any rectangular section of the table (very useful!) Copy it.

  3. If you are using Firefox or Chrome, press the TSV button on the palette to paste the data into the notebook at the current insertion point. I'm not sure if other browsers also separate items with tabs when copying.

The result will look like this:

{{"Afghanistan", 35.`, "2008[3]"}, {"Albania", 13.49`, 
  "2010 (Q4)[4]"}, {"Algeria", 10.`, 
  "2010 (September)[5]"}, {"American Samoa (United States)", 23.8`, 
  "2010[3]"}, {"Andorra", 2.9`, 2009}}

As you can see, some post-processing is needed to convert years to a proper format (string or integer?)


This is the old palette code. I realize it's in need of cleanup, but it works as it is, and I haven't had time to fix it up yet. Report any issues in comments below.

CreatePalette@Column@{Button["TSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "TSV"], {2}]]
      ]
     ]
    ],
   Button["CSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "CSV"], {2}]]
      ]
     ]
    ],
   Button["Table",
    Module[{data},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@ImportString[data, "Table"]]
      ]
     ]
    ]}
5
votes
Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Data"]

Of course, the result will frequently need further processing. How do you want to visualize it?

You can find all Import types using

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Elements"]
4
votes

If you want to go the Import[ ... , "XMLObject" ] route, here is an outline of what you can do.

First, get the page:

page = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "XMLObject"];

Next, get the table of interest (in this case the big table also happens to be the first of seven tables on this page):

table = Cases[page, XMLElement["table", ___], \[Infinity]][[1]]

Next, get a row from the table, I picked the fourth row which corresponds with Algeria:

row = Cases[table, XMLElement["tr", ___], [Infinity]][[4]]

Next, extract the table data elements () from this row:

data = Cases[row, XMLElement["td", ___], \[Infinity]]

Out of those elements, you can pick for example the country flag thumbnail, like so:

image = Cases[data, XMLElement["img", {___, "src" -> src_, ___}, _] :> src, \[Infinity]]

Finally import that image thumbnail (it needed "http:" prepended for some reason):

Import["http:" <> image]

This is what the notebook looks like (the thumbnail, plus the other inputs):

Mathematica graphics

3
votes

For certain values of 'easy', yes. See here: HTML Import documentation for Mathematica 8.

You can import from tables using the "Data" format option, e.g. Import["file.hml", "Data"]. That's a start, but your link is a whole DOM-tree's worth of tables, divs and other things. It's documented, but thinly, and you'd have to experiment. It does work with URLs though.

This actually works. With a bit of cleaning you could use the data here:

Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]