I am a very novice R programmer, but I have been attempting to do some webscraping off of the website of an online university using the rvest package. The first table of information I scraped from the webpage was a listing of all of the doctoral level program offered. Here is my code:
library(xml2)
library(httr)
library(rvest)
library(selectr)
Scraping Capella Doctoral
fileUrl <- read_html("http://www.capella.edu/online-phd-programs/")
Using the selector gadget tool in chrome, I was able to select the content on the site I wanted to extract. In this case, I am selecting all doctoral level programs.
Degrees <- fileUrl %>%
html_nodes(".accordianparsys a") %>%
html_text()
Degrees
Next, I created a data frame of the doctoral level degrees.
Capella_Doctoral = data.frame(Degrees)
Below I am creating another column that flags these programs as coming from Capella.
Capella_Doctoral$SchoolFlag <- "Capella"
View(Capella_Doctoral)
Everything seems to work great in my code above. However, the next type of information I would like to scrape is tuition cost and credit hours per doctoral program. This information exists on each individual doctoral program's page. For example, the PhD in Leadership program will contain the tuition and credit hour information on this page "http://www.capella.edu/online-degrees/phd-leadership/". The DBA in Accounting program will contain tuition and credit hour information on this page "http://www.capella.edu/online-degrees/dba-accounting/". The common theme among the various pages is that it includes the name of the program after "online-degrees/".
In order to create a list of the various web pages I need (those that include the doctoral program names), I developed the code below.
Formatting the doctoral degrees into lowercase, removing any leading and trailing whitespace, and then replacing any spaces with dashes
Lowercase <- tolower(Capella_Doctoral$Degrees)
Lowercase
Removing leading and trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Trim <- trim(Lowercase)
Trim
replacing spaces with dashes
Dashes <- gsub(" ", "-", Trim)
Dashes
Dashes2 <- gsub("---", "-", Dashes)
Dashes2
Next, I add the reformatted doctoral degrees to the end of the below url to get a listing of all of the possible urls I need to scrape information from about the tuition and credits hours for each program
urls <- rbindlist(sapply(Dashes2, function(x) {
url <- paste("http://www.capella.edu/online-degrees/",x,"/", sep="")
data.frame(url)
}), fill=TRUE)
Spec_URLs <- data.frame(urls)
View(Spec_URLs)
Now that I have a listing of all of the urls I need to scrape information from, I need to know how I can use the below function for each of the urls. The code below is only extracting tuition and credit hour info for one of the URLs. How do I get it to loop through all of the URLS? My end goal is to get a table of all of the tuition and credit hour information for each doctoral program into a data frame.
fileUrl <- read_html("http://www.capella.edu/online-degrees/phd-leadership/")
Tuition <- fileUrl %>%
html_nodes("p:nth-child(4) strong , .tooltip~ strong") %>%
html_text()
Tuition
Results: Tuition [1] "120 Credits" "$4,665 per quarter"