Webscraping content across multiple pages using rvest package

Question

I am a very novice R programmer, but I have been attempting to do some webscraping off of the website of an online university using the rvest package. The first table of information I scraped from the webpage was a listing of all of the doctoral level program offered. Here is my code:

library(xml2)
library(httr)
library(rvest)
library(selectr)

Scraping Capella Doctoral

fileUrl <- read_html("http://www.capella.edu/online-phd-programs/")

Using the selector gadget tool in chrome, I was able to select the content on the site I wanted to extract. In this case, I am selecting all doctoral level programs.

Degrees <- fileUrl %>%
html_nodes(".accordianparsys a") %>%
html_text() 
Degrees

Next, I created a data frame of the doctoral level degrees.

Capella_Doctoral = data.frame(Degrees)

Below I am creating another column that flags these programs as coming from Capella.

Capella_Doctoral$SchoolFlag <- "Capella" 
View(Capella_Doctoral)

Everything seems to work great in my code above. However, the next type of information I would like to scrape is tuition cost and credit hours per doctoral program. This information exists on each individual doctoral program's page. For example, the PhD in Leadership program will contain the tuition and credit hour information on this page "http://www.capella.edu/online-degrees/phd-leadership/". The DBA in Accounting program will contain tuition and credit hour information on this page "http://www.capella.edu/online-degrees/dba-accounting/". The common theme among the various pages is that it includes the name of the program after "online-degrees/".

In order to create a list of the various web pages I need (those that include the doctoral program names), I developed the code below.

Formatting the doctoral degrees into lowercase, removing any leading and trailing whitespace, and then replacing any spaces with dashes

Lowercase <- tolower(Capella_Doctoral$Degrees) 
Lowercase

Removing leading and trailing whitespace

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Trim <- trim(Lowercase)
Trim

replacing spaces with dashes

Dashes <- gsub(" ", "-", Trim)
Dashes
Dashes2 <- gsub("---", "-", Dashes)
Dashes2

Next, I add the reformatted doctoral degrees to the end of the below url to get a listing of all of the possible urls I need to scrape information from about the tuition and credits hours for each program

urls <- rbindlist(sapply(Dashes2, function(x) {
    url <- paste("http://www.capella.edu/online-degrees/",x,"/", sep="")
    data.frame(url)
}), fill=TRUE)
Spec_URLs <- data.frame(urls)
View(Spec_URLs)

Now that I have a listing of all of the urls I need to scrape information from, I need to know how I can use the below function for each of the urls. The code below is only extracting tuition and credit hour info for one of the URLs. How do I get it to loop through all of the URLS? My end goal is to get a table of all of the tuition and credit hour information for each doctoral program into a data frame.

fileUrl <- read_html("http://www.capella.edu/online-degrees/phd-leadership/")

Tuition <- fileUrl %>%
   html_nodes("p:nth-child(4) strong , .tooltip~ strong") %>%
   html_text() 
Tuition

Results: Tuition [1] "120 Credits" "$4,665 per quarter"

Carl Boneri Carl Boneri · Accepted Answer · 2016-03-30T16:18:55

This is a quick and dirty....and I'm hoping it won't create more questions than answers. Essentially this function grabs all of the individual urls linking to departments...and then performs the same series on each returning one aggregated data object. In our case a data frame with 82 rows. If you wanted to clean this up you could reformat the columns and clean the NA's up a bit. Hope it works for you.

library(rvest)
library(stringi)
library(htmltools)
library(plyr)
library(dplyr)
library(DT)


# This is a helper function I threw on top..
txt.safe_text <- function(x){
  str_in <- iconv(x, "latin1", "ASCII", sub="")  %>%  stri_enc_toutf8()
  str_in %>%
    stri_replace_all_fixed('<U+0080><U+0093>',"'\\-'") %>%
    stri_enc_toascii %>% htmlEscape %>%
    stri_unescape_unicode %>%
    stri_replace_all_regex("\\032\\032\\032","-")%>%
    stri_replace_all_regex("\n","")
}




# Heres the iterator. I gave it zero args for purposes of the concept but you
# could add varible urls or filtering functions

parse.apella <- function(){


  # html() was deprecated but I use the older version of rvest so set the new name
  # to an alias for reproduction.
  read_html <- html


  # This is our index table. We are going to use this as a key to then qry all
  # other site info but keep a backref to the school variable and url
  idx_df <-
    lapply(read_html("http://www.capella.edu/online-phd-programs/") %>%
             html_nodes(".accordianparsys a"),function(i)
               data.frame(focus = html_text(i),
                          link = paste0("http://www.capella.edu", html_attr(i,"href"))
                          )) %>% rbind.pages

  # Threw this in for use case later with rendering a datatable and then being able to
  # jump straight to the site you are referencing.

  idx_df$html_output <- sapply(1:nrow(idx_df),function(i)
    htmltools::HTML(paste0(sprintf('<a href="%s">%s</a>',idx_df[i,2],idx_df[i,1]))))


  # Ok...so... for every index in our idx_df table above we are going to:
  # read site > parse the p html tags > pass a text cleaning function >
  # replace the leftovers eg:'\t' > split the string on the new line '\n'
  # character for easier user in building a data frame later > filter out all
  # returned data that has a character length of less than  or equal to 2 >
  # create a data frame with a filtering column in our loop.

  # Note: this is going to get the data for I think 84 websites..so give it a second
  # to run.

  A <- llply(1:nrow(idx_df),function(ii)
    lapply(read_html(idx_df[[2]][[ii]]) %>%
             html_nodes(".gernic_large_text > p") %>%
             html_text %>% txt.safe_text %>%
             stri_replace_all_regex("\t","\n") %>%
             strsplit("\n"),function(i)
               stri_split_regex(i,"  ") %>% unlist %>%
             data.frame(raw_txt = .) %>% filter(nchar(raw_txt)>2) %>%
             mutate(df_idx = 1:length(raw_txt),
                    school_name = idx_df[[1]][[ii]],
                    html_link = idx_df[[3]][[ii]])
    )
  )


  # Above we built a list of data frames...and the rule we know is that any information
  # we are interested in would produce at least two rows of data as we split
  # our raw html on the new line character. This means any data frame in our list
  # with 1 row is non-imporant but was easier to filter out than parse out earlier.
  # So we remove all those data frames with only 1 row.
  CC <- lapply(1:length(A),function(i)A[[i]][mapply(nrow,A[[i]]) == 2] %>% rbind.pages)


  # Helper function for looping through. I shouldn't have used numbers for the column names
  # but i'm just slapping this together.
  # This is going to essentially go through our data frames and transpose the structure
  # so that our final product is a wide data structure rather than a long.

  trans_df <- function(df_in = NULL,i){
    tmp_d <-
      as.data.frame(
        t(c(df_in[[i]][df_in[[i]][[2]] == 2,4][[1]],
            df_in[[i]][df_in[[i]][[2]] == 2,3][[1]],
            df_in[[i]][df_in[[i]][[2]] == 2,1]))
      )

    colnames(tmp_d) <-  c('html_link','school name',df_in[[i]][df_in[[i]][[2]] == 1,1])
    tmp_d
  }


  #  For every index in our list we're going to transpose our structures
  # And do some text cleaning and splitting
  all_dat <- ldply(1:length(CC),function(i)trans_df(df_in = CC,i)) %>%
    mutate(short_name = stri_extract_first_words(`school name`),
           Cost =
             ifelse(!is.na(Cost),
                    stri_extract_first_words(Cost),
                    'Not Listed')
           ) %>% mutate(program =
                   stri_replace_all_regex(
                     `school name`,
                     paste0('(',short_name,'| - )'),"") %>%
                   stri_trim_both) %>%
    mutate(next_session = as.Date(strptime(`Next Start Date`,"%b. %d,%Y"))) %>%
    mutate(Cost = as.numeric(gsub(",","",Cost))) %>% 
  select(html_link,
         short_name,
         program,
         cost = Cost,
         credit_hours = `Transfer Credits`,
         next_session,
         total_credits = `Total Quarter Credits`,
         session_length = `Course Length`)

  # Quick thing I noticed on the credit hours. Loop back over and
  # grab only the numeric values
  all_dat$credit_hours <-
    lapply(all_dat$credit_hours,function(i)
      stri_extract_all_regex(i,"[[:digit:]]") %>%
        unlist %>% paste0(collapse = "") %>% as.numeric) %>%
    unlist


  # Should be done
  return(all_dat)
}



rock.apella <- parse.apella()

str(rock.apella)
# 'data.frame':  82 obs. of  8 variables:
# $ html_link     : chr  "<a href=\"http://www.capella.edu/online-degrees/phd-leadership\">PHD - Leadership </a>"| __truncated__ ...
# $ short_name    : chr  "PHD" "PHD" "PHD" "PHD" ...
# $ program       : chr  "Leadership" "Information Technology Education" "General Information Technology" "Information Assurance and Security" ...
# $ cost          : num  4665 4665 4665 4665 4665 ...
# $ credit_hours  : num  32 32 48 32 32 32 32 32 48 32 ...
# $ next_session  : Date, format: "2016-04-11" "2016-04-11" "2016-04-11" "2016-04-11" ...
# $ total_credits : chr  "120 Credits" "120 Credits" "120 Credits" "120 Credits" ...
# $ session_length: chr  "10 weeks" "10 weeks" "10 weeks" "10 weeks" ...

DT::datatable(rock.apella,escape = F, options = list(searchHighlight = TRUE), filter = 'top')

Here's our final output

And the output in jsfiddle https://jsfiddle.net/cbfas/0x37vudv/1/

Webscraping content across multiple pages using rvest package

1 Answers