1
votes

I’m totally new to web scraping and I’m exploring the potentialities of the rvest library in R.

I’m trying to scrape a table on wellbeing in Italian provinces from the following website,

install.packages('rvest') 

library('rvest')

url <- 'http://www.ilsole24ore.com/speciali/qvita_2017_dati/home.shtml'

webpage <- read_html(url)

but I’m unable to identify the XPath of the table.

1
This site has a javascript generated table, so you are going to be unable to scrape it, unless you use some external tool like phantomjs....check the following site datacamp.com/community/tutorials/…Hackerman

1 Answers

3
votes

Even with the following, you have quite a bit of work to do. The HTML is in terrible shape.

library(rvest)
library(stringi)
library(tidyverse)

read_html("http://www.ilsole24ore.com/speciali/qvita_2017_dati/home.shtml") %>%  # get the main site
  html_node(xpath=".//script[contains(., 'goToDefaultPage')]") %>%               # find the <script> block that dynamically loads the page
  html_text() %>%
  stri_match_first_regex("goToDefaultPage\\('(.*)'\\)") %>%                      # extract the page link
  .[,2] %>% 
  sprintf("http://www.ilsole24ore.com/speciali/qvita_2017_dati/%s", .) %>%       # prepend the URL prefix
  read_html() -> actual_page                                                     # get the dynamic page

tab <- html_nodes(actual_page, xpath=".//table")[[2]]                            # find the actual data table

Once you do ^^ you have an HTML <table>. It's in terrible, awful, pathetic shape and that site shld rly be ashamed of how it abuses HTML.

Go ahead and try html_table(). It's so bad it breaks httr.

We need to attack it by row and will need a helper function soas to not have the R code look horrible:

`%|0%` <- function(x, y) { if (length(x) == 0) y else x }

^^ will help us fill in NULL-like content with a blank "".

Now, we go row-by-row, extracting the <td> values we need. This does not get all of them since I don't need this data and it needs cleaning as we'll see in a bit;

html_nodes(tab, "tr") %>% 
  map_df(~{
    list(
      posizione = html_text(html_nodes(.x, xpath=".//td[2]"), trim=TRUE) %|0% "",
      diff_pos = html_text(html_nodes(.x, xpath=".//td[5]"), trim=TRUE) %|0% "",
      provincia = html_text(html_nodes(.x, xpath=".//td[8]"), trim=TRUE) %|0% "",
      punti = html_text(html_nodes(.x, xpath=".//td[11]"), trim=TRUE) %|0% "",
      box1 = html_text(html_nodes(.x, xpath=".//td[14]"), trim=TRUE) %|0% "",
      box2 = html_text(html_nodes(.x, xpath=".//td[17]"), trim=TRUE) %|0% "",
      box3 = html_text(html_nodes(.x, xpath=".//td[20]"), trim=TRUE) %|0% ""
    )
  })
## # A tibble: 113 x 7
##     posizione             diff_pos            provincia                 punti  box1  box2  box3
##         <chr>                <chr>                <chr>                 <chr> <chr> <chr> <chr>
##  1            Lavoro e Innovazione                      Giustizia e Sicurezza                  
##  2 Diff. pos.                                                                                  
##  3          1                    3              Belluno                   583                  
##  4          2                   -1                Aosta                   578     9    63    22
##  5          3                    2              Sondrio                   574     4    75     1
##  6          4                    3              Bolzano                   572     2     4     7
##  7          5                   -2               Trento                   567     8    11    15
##  8          6                    4              Trieste                   563     6    10     2
##  9          7                    9 Verbano-Cusio-Ossola                   548    18    73    25
## 10          8                   -6               Milano                   544     1     2    10
## # ... with 103 more rows

As you can see, it misses some things and has some junk in the header, but you're further along than you were before.