Even with the following, you have quite a bit of work to do. The HTML is in terrible shape.
library(rvest)
library(stringi)
library(tidyverse)
read_html("http://www.ilsole24ore.com/speciali/qvita_2017_dati/home.shtml") %>% # get the main site
html_node(xpath=".//script[contains(., 'goToDefaultPage')]") %>% # find the <script> block that dynamically loads the page
html_text() %>%
stri_match_first_regex("goToDefaultPage\\('(.*)'\\)") %>% # extract the page link
.[,2] %>%
sprintf("http://www.ilsole24ore.com/speciali/qvita_2017_dati/%s", .) %>% # prepend the URL prefix
read_html() -> actual_page # get the dynamic page
tab <- html_nodes(actual_page, xpath=".//table")[[2]] # find the actual data table
Once you do ^^ you have an HTML <table>
. It's in terrible, awful, pathetic shape and that site shld rly be ashamed of how it abuses HTML.
Go ahead and try html_table()
. It's so bad it breaks httr
.
We need to attack it by row and will need a helper function soas to not have the R code look horrible:
`%|0%` <- function(x, y) { if (length(x) == 0) y else x }
^^ will help us fill in NULL-like content with a blank ""
.
Now, we go row-by-row, extracting the <td>
values we need. This does not get all of them since I don't need this data and it needs cleaning as we'll see in a bit;
html_nodes(tab, "tr") %>%
map_df(~{
list(
posizione = html_text(html_nodes(.x, xpath=".//td[2]"), trim=TRUE) %|0% "",
diff_pos = html_text(html_nodes(.x, xpath=".//td[5]"), trim=TRUE) %|0% "",
provincia = html_text(html_nodes(.x, xpath=".//td[8]"), trim=TRUE) %|0% "",
punti = html_text(html_nodes(.x, xpath=".//td[11]"), trim=TRUE) %|0% "",
box1 = html_text(html_nodes(.x, xpath=".//td[14]"), trim=TRUE) %|0% "",
box2 = html_text(html_nodes(.x, xpath=".//td[17]"), trim=TRUE) %|0% "",
box3 = html_text(html_nodes(.x, xpath=".//td[20]"), trim=TRUE) %|0% ""
)
})
## # A tibble: 113 x 7
## posizione diff_pos provincia punti box1 box2 box3
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Lavoro e Innovazione Giustizia e Sicurezza
## 2 Diff. pos.
## 3 1 3 Belluno 583
## 4 2 -1 Aosta 578 9 63 22
## 5 3 2 Sondrio 574 4 75 1
## 6 4 3 Bolzano 572 2 4 7
## 7 5 -2 Trento 567 8 11 15
## 8 6 4 Trieste 563 6 10 2
## 9 7 9 Verbano-Cusio-Ossola 548 18 73 25
## 10 8 -6 Milano 544 1 2 10
## # ... with 103 more rows
As you can see, it misses some things and has some junk in the header, but you're further along than you were before.