3
votes

I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.

More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.

My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?

I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.

library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)

test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)

# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()

# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()

remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()

# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)

ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()

You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.

Showing just a small chunk, choosing "BEARFACTS" in rvest is:

BEARFACTS\n                                    \n                                                \n                                    

while in RSelenium it is something like the following :

<li class=\"expanded dropdown\">\n                    <a href=\"https://apps.bea.gov/regional/bearfacts/\">BEARFACTS</a>\n  
1
Please always check, whether scraping a page is permitted by its owner. robotstxt::paths_allowed(test_url) yields FALSE, you should therefore not use it as an example.Thomas K
Thanks for the heads up! It's a useful package that I will definitely use. I have changed the example accordingly.Hong

1 Answers

4
votes

The difference between RSelenium and rvest is:

  • RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
  • rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.

Some useful tips:

  • When scraping a page that doesn't load javascript, use rvest.
  • When you must use RSelenium, try using a headless option to improve speed (it will load the page in a browser just like normal, but it won't display any of the graphical elements, so it will be faster).

Example of using RSelenium headless

eCaps <- list(chromeOptions = list(
  args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))

rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps)