1
votes

There is a database at the Rwanda Cooperative Board; it has about 155 pages of data I'd like to access (without scrolling through the entire website). I'm having issues identifying the correct xpath or css using the rvest package in R. I'm also using the selector gadget tool to help identify the correct nodes.

My problem is that I'm receiving a 'character(0)' suggesting that I'm not scraping the correct data.

url <- 'http://www.rca.gov.rw/wemis/registration/all.php?start=0&status=approved'

html <- read_html(url)

rank_data_html <- html_nodes(html, css = '.primary td')

rank_data <- html_text(rank_data_html)

head(rank_data)

Is there a way I change the code to cycle through and scrape the data?

1
On first glance it looks like you're using the wrong CSS selector. I'm not entirely familiar with the html_nodes function; I suggest double checking whether you're making the right selection. For example, my selector for the first cell of the Certificate is table.primary > tbody:nth-child(1) > tr:nth-child(2) > td:nth-child(4), which suggests that you might be more successful if you use table.primary instead of .primary. - Chris Hartgerink

1 Answers

7
votes

This has nothing to do with using the wrong selector. The site you're scraping does something super interesting on first-access:

enter image description here

When you hit the page it sets a cookie then refreshes the page (this is one of the stupidest ways I've seen to force a "session", ever).

Unless you use a proxy server to capture the web requests you'd never really be able to see this even in the network tab of browser developer tools. You could also see it, though, by looking at what came back from the initial read_html() call you did (it just has the javascript+redirect).

Neither read_html() nor httr::GET() can help you with this directly since the way the cookie is set is via javascript.

BUT! All hope is not lost and no silly third-party requirement like Selenium or Splash is required (I'm shocked that wasn't already suggested by the resident experts as it seems to be the default response these days).

Let's get the cookie (make sure this is a FRESH, RESTARTED, NEW R session since libcurl — which curl📦 uses which is, in turn, used by httr::GET() which read_html() ultimately uses — maintains cookies (we'll be using this functionality to continue scraping pages but if anything goes awry you will likely need to start with a fresh session).

library(xml2)
library(httr)
library(rvest)
library(janitor)

# Get access cookie

httr::GET(
  url = "http://www.rca.gov.rw/wemis/registration/all.php",
  query = list(
    start = "0",
    status = "approved"
  )
) -> res

ckie <- httr::content(res, as = "text", encoding = "UTF-8")
ckie <- unlist(strsplit(ckie, "\r\n"))
ckie <- grep("cookie", ckie, value = TRUE)
ckie <- gsub("^document.cookie = '_accessKey2=|'$", "", ckie)

Now, we're going to set the cookie and get our PHP session cookie, both of which will persist afterwards:

httr::GET(
  url = "http://www.rca.gov.rw/wemis/registration/all.php",
  httr::set_cookies(`_accessKey2` = ckie),
  query = list(
    start = "0",
    status = "approved"
  )
) -> res

Now, there are over 400 pages so we're going to cache the raw HTML in the event you scrape something wrong and need to re-parse the pages. That way you can iterate over files vs hit the site again. To do this we'll make a directory for them:

dir.create("rca-temp-scrape-dir")

Now, create the pagination start numbers:

pgs <- seq(0L, 8920L, 20)

And, iterate over them. NOTE: I don't need all 400+ pages so I just did 10. Remove the [1:10] to get them all. Also, unless you like hurting other people, please keep the sleep in since you don't pay for the cpu/bandwidth and that site is likely very fragile.

lapply(pgs[1:10], function(pg) { 

  Sys.sleep(5) # Please don't hammer servers you don't pay for

  httr::GET(
    url = "http://www.rca.gov.rw/wemis/registration/all.php",
    query = list(
      start = pg,
      status = "approved"
    )
  ) -> res

  # YOU SHOULD USE httr FUNCTIONS TO CHECK FOR STATUS
  # SINCE THERE CAN BE HTTR ERRORS THAT YOU MAY NEED TO 
  # HANDLE TO AVOID CRASHING THE ITERATION

  out <- httr::content(res, as = "text", encoding = "UTF-8")

  # THIS CACHES THE RAW HTML SO YOU CAN RE-SCRAPE IT FROM DISK IF NECESSARY  

  writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

  out <- xml2::read_html(out)
  out <- rvest::html_node(out, "table.primary")
  out <- rvest::html_table(out, header = TRUE, trim = TRUE)

  janitor::clean_names(out) # makes better column names

}) -> recs

Finally, we'll combine those 20 data frames into one:

recs <- do.call(rbind.data.frame, recs)

str(recs)
## 'data.frame': 200 obs. of  9 variables:
##  $ s_no       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ code       : chr  "BUG0416" "RBV0494" "GAS0575" "RSZ0375" ...
##  $ name       : chr  "URUMURI RWA NGERUKA" "BADUKANA IBAKWE NYAKIRIBA" "UBUDASA COOPERATIVE" "KODUKB" ...
##  $ certificate: chr  "RCA/0734/2018" "RCA/0733/2018" "RCA/0732/2018" "RCA/0731/2018" ...
##  $ reg_date   : chr  "10.12.2018" "-" "10.12.2018" "07.12.2018" ...
##  $ province   : chr  "East" "West" "Mvk" "West" ...
##  $ district   : chr  "Bugesera" "Rubavu" "Gasabo" "Rusizi" ...
##  $ sector     : chr  "Ngeruka" "Nyakiliba" "Remera" "Bweyeye" ...
##  $ activity   : chr  "ubuhinzi (Ibigori, Ibishyimbo)" "ubuhinzi (Imboga)" "transformation (Amasabuni)" "ubworozi (Amafi)" ...

If you're a tidyverse user you can also just do:

purrr::map_df(pgs[1:10], ~{

  Sys.sleep(5) 

  httr::GET(
    url = "http://www.rca.gov.rw/wemis/registration/all.php",
    httr::set_cookies(`_accessKey2` = ckie),
    query = list(
      start = .x,
      status = "approved"
    )
  ) -> res

  out <- httr::content(res, as = "text", encoding = "UTF-8")
  writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

  out <- xml2::read_html(out)
  out <- rvest::html_node(out, "table.primary")
  out <- rvest::html_table(out, header = TRUE, trim = TRUE)

  janitor::clean_names(out)

}) -> recs

vs the lapply/do.call/rbind.data.frame approach.