I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.
More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.
My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?
I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.
library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)
test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)
# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()
# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()
# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()
You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.
Showing just a small chunk, choosing "BEARFACTS" in rvest is:
BEARFACTS\n \n \n
while in RSelenium it is something like the following :
<li class=\"expanded dropdown\">\n <a href=\"https://apps.bea.gov/regional/bearfacts/\">BEARFACTS</a>\n
robotstxt::paths_allowed(test_url)
yieldsFALSE
, you should therefore not use it as an example. – Thomas K