0
votes

I'm new in web scraping using R. I'm trying to scrape the table generated by this link: https://gd.eppo.int/search?k=saperda+tridentata. In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).

I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page". Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?

Below you have my code. Thank you in advance!

library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)

pest.name <- "saperda+tridentata"

url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text") 

json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8") 

table_contents <- JSON     %>%
  {gsub("\\\\n", "\n", .)}  %>%
  {gsub("\\\\/", "/", .)}   %>%
  {gsub("\\\\\"", "\"", .)} %>%
  strsplit("html\":\"")    %>%
  unlist                   %>%
  extract(2)               %>%
  substr(1, nchar(.) -2)   %>% 
  paste0("</tbody>")

new_page <- gsub("</tbody>", table_contents, resp)

read_html(new_page)   %>%
  html_nodes("table") %>%
  html_table()
1

1 Answers

0
votes

The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.

library(httr)
library(jsonlite)

params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)