RVest: Scraping the text of a website with limited access

Question

I'm currently webscraping a news-site using rvest. The scraper is working, but on the news site, i got limited access to the exclusive articles listed there. Hence i need a working loop, that doesn't stop when facing the case of non-avaiability of certain selectors.

On top of that, i don't find the proper selector to scrape the whole text. Hopefully you can help me with my problem.

library(rvest)
sz_webp <- read_html ("https://www.sueddeutsche.de/news?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&all%5B%5D=time")

# TITLE

title <- sz_webp %>% 
  html_nodes("a em") %>%   
  html_text()

df <- data.frame(title)

# TIME

time <- sz_webp %>% 
  html_nodes("div time") %>%   
  html_text() 

df$time <- time

url <- sz_webp %>% 
  html_nodes("a") %>% html_attr('href')

url <- url[which(regexpr('https://www.sueddeutsche.de/', url) >= 1)]
N <- 58
n_url <- tail(url, -N)

n_url <- head(n_url,-17)

View(n_url)

df$url <- n_url

# LOOP THAT DOESNT WORK (not the right selector and it cancels when facing the problem)

results_df <- lapply(n_url, function(u) { 
  message(u) 

  aktuellerlink <- read_html(u) # liest die jeweilige URL ein

  text <- aktuellerlink %>% # liest das Baujahr aus
    html_nodes("div p") %>%
    html_text()

  } %>%

bind_rows()
)
df$text <- results_df

View(df)

Thanks a lot in advance.

jazzurro jazzurro · Accepted Answer · 2020-02-08T16:25:06

I am not familiar with the web site. I am not able to read German either. As far as I see your code, you are trying to scrape titles, time, and urls with sz_webp. Then, for each url, you try to scrape texts. I think you can improve your code by focusing on specific parts in the link. If you look into the source page, you can identify the locations. You have thee specific positions you need to scrape.

livrary(rvest)
library(tidyverse)

map_dfc(.x = c("em.entrylist__title", "time.entrylist__time"),
        .f = function(x) {read_html("https://www.sueddeutsche.de/news?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&all%5B%5D=time") %>% 
                          html_nodes(x) %>% 
                          html_text()}) %>% 
bind_cols(url = read_html("https://www.sueddeutsche.de/news?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&all%5B%5D=time") %>% 
                html_nodes("a.entrylist__link") %>% 
                html_attr("href")) %>% 
setNames(nm = c("title", "time", "url")) -> temp

temp looks like this. If you want, you need to clean up time. It is still messy.

   title                                       time              url                                                                  
   <chr>                                       <chr>             <chr>                                                                
 1 "Immer mehr Corona-Infektionen in China"    "\n    13:23\n"   https://www.sueddeutsche.de/politik/immer-mehr-corona-infektionen-in~
 2 "US-Amerikaner an Corona-Virus gestorben"   "\n    08:59\n"   https://www.sueddeutsche.de/panorama/virus-infektion-us-amerikaner-a~
 3 "Frau eines weiteren Webasto-Mitarbeiters ~ "\n    07.02.202~ https://www.sueddeutsche.de/bayern/coronavirus-bayern-newsblog-muenc~
 4 "Digitale Revolte"                          "\n    07.02.202~ https://www.sueddeutsche.de/politik/china-digitale-revolte-1.4788941 
 5 "Nachrichten kompakt - die Übersicht für E~ "\n    07.02.202~ https://www.sueddeutsche.de/politik/nachrichten-thueringen-kemmerich~
 6 "\"Ich würde mir wünschen, dass die Mensch~ "\n    07.02.202~ https://www.sueddeutsche.de/wirtschaft/webasto-coronavirus-bayern-in~
 7 "Deutschland will weitere Bürger zurückhol~ "\n    07.02.202~ https://www.sueddeutsche.de/politik/coronavirus-deutschland-will-wei~
 8 "Peking wird wegenzur \"Geisterstadt\""     "\n    07.02.202~ https://www.sueddeutsche.de/panorama/angst-vor-corona-peking-wird-we~
 9 "Was bedeutet die Corona-Epidemie für Chin~ "\n    07.02.202~ https://www.sueddeutsche.de/politik/coronavirus-wuhan-li-wenliang-1.~
10 "Virus des Widerstands"                     "\n    07.02.202~ https://www.sueddeutsche.de/politik/china-coronavirus-arzt-1.4788564

Then, for each URL, you want to scrape texts. I am not sure how this web page works. But I inspected a few and found that each link can have multiple articles on surface. Is that right? Contents are staying in div.sz-article__body. You gotta further choose <p> not having sz-teaser__summary in class. Then, you can scrape the contents which you are probably looking for. Here I looped through three links. The first one does not offer any texts. Maybe this is the one you are talking about; not-accessible contents. I hope this is enough for you to make further progress.

map_df(.x = temp$url[1:3],
       .f = function(x){tibble(url = x,
                        text = read_html(x) %>% 
                                html_nodes("div.sz-article__body") %>% 
                                html_nodes("p:not(.sz-teaser__summary)") %>% 
                                html_text() %>% 
                                list
                        )}) %>% 
unnest(text) -> foo

foo

   url                                                        text                                                                    
   <chr>                                                      <chr>                                                                   
 1 https://www.sueddeutsche.de/panorama/virus-infektion-us-a~ "In Wuhan ist ein Amerikaner an einer Corona-Infektion gestorben. Wie d~
 2 https://www.sueddeutsche.de/panorama/virus-infektion-us-a~ "Auch ein Japaner starb nach Einschätzung des Tokioter Außenministerium~
 3 https://www.sueddeutsche.de/panorama/virus-infektion-us-a~ "Bisher sind außerhalb Festland-Chinas zwei Todesfälle infolge einer Co~
 4 https://www.sueddeutsche.de/panorama/virus-infektion-us-a~ "Damit könnte sie in Kürze die weltweit offiziell registrierten 774 Tod~
 5 https://www.sueddeutsche.de/panorama/virus-infektion-us-a~ "Coronavirus"                                                           
 6 https://www.sueddeutsche.de/bayern/coronavirus-bayern-new~ "Freitag, 7. Februar, 19.37 Uhr In Bayern gibt es einen weiteren Corona~
 7 https://www.sueddeutsche.de/bayern/coronavirus-bayern-new~ "Freitag, 7. Februar, 18.19 Uhr: Der Coronavirus-Ausbruch hat den bayer~
 8 https://www.sueddeutsche.de/bayern/coronavirus-bayern-new~ "Freitag, 7. Februar, 15.05 Uhr: Der Verdacht, der bayerische Coronavir~
 9 https://www.sueddeutsche.de/bayern/coronavirus-bayern-new~ "Die bayerischen Fälle gehen alle auf betriebsinterne Schulungen in der~
10 https://www.sueddeutsche.de/bayern/coronavirus-bayern-new~ "Donnerstag, 6. Februar, 13.35 Uhr: In Bayern hat sich eine weitere Fra~

RVest: Scraping the text of a website with limited access

1 Answers