1
votes

I'd like to scrape Amazon customer reviews and while my code works fine if there's no "missing" information, converting the scraped data to a data frame doesn't work anymore if parts of the data are missing (arguments imply differing number of rows).

This is an example code:

library(rvest) 

url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent")

get_reviews <- function(url) {

  title <- url %>%
    html_nodes("#cm_cr-review_list .a-color-base") %>%
    html_text()

  author <- url %>%
    html_nodes(".author") %>%
    html_text()

  df <- data.frame(title, author, stringsAsFactors = F)

  return(df)
} 

results <- get_reviews(url)

In this case, "missing" means that there's no author information provided for multiple customer reviews (Ein Kunde simply means A customer in German).

Does anyone have an idea on how to fix this? Any help is appreciated. Thanks in advance!

2

2 Answers

1
votes

would say here is the answer for your question (link)

Each on the 'div[id*=customer_review]' and then check whether there is that value for the author or not.

0
votes

Adapting an approach from the link Nardack provided, I could scrape the data with the following code:

library(dplyr)
library(rvest)

get_reviews <- function(node){

  r.title <- html_nodes(node, ".a-color-base") %>%
    html_text() 

  r.author <- html_nodes(node, ".author") %>%
    html_text() 

  df <- data.frame(
    title = ifelse(length(r.title) == 0, NA, r.title),
    author = ifelse(length(r.author) == 0, NA, r.author), 
    stringsAsFactors = F)

  return(df)  
}

url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent") %>% html_nodes("div[id*=customer_review]")
out <- lapply(url, get_reviews) %>% bind_rows()