1
votes

I'm attempting to scrape political endorsement data from wikipedia tables (a pretty generic scraping task) and the regular process of using rvest on the css path identified by selector gadget is failing.

The wiki page is here, and the css path .jquery-tablesorter:nth-child(11) td seems to select the right part of the page right part of wikitable selected

Armed with the css, I would normally just use rvest to directly access these data, as follows:

"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012" %>% 
   html %>% 
   html_nodes(".jquery-tablesorter:nth-child(11) td")

but this returns:

list()
attr(,"class")
[1] "XMLNodeSet"

Do you have any ideas?

2
What part of the page are you actually trying trying to get? - CiarĂ¡n Tobin
The table, from the "Former President" column to "Notes" - tomw

2 Answers

3
votes

This might help:

library(rvest)
URL <- "https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012" 
tab <- URL %>% read_html %>%  
            html_node("table.wikitable:nth-child(11)") %>% html_table()

This code stores the table that you requested as a dataframe in the variable tab.

> View(tab)

enter image description here

1
votes

I find that if I use the xpath suggestion from Chrome it works.

Chrome suggests an xpath of //*[@id="mw-content-text"]/table[4]

I can then run as follows

library(rvest)

    URL <-"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012" 
    tab <- URL %>% 
      read_html %>%  
      html_node(xpath='//*[@id="mw-content-text"]/table[4]') %>% 
      html_table()