0
votes

I am attempting to scrape data related to fantasy football player salaries from rotoguru1.com. A sample webpage I am attempting to gather data from can be found here: http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2014&game=dk&scsv=1. The data is conveniently available in scsv format on each page under the html "pre" tag. I first use a for loop to generate all the urls I want to scrape data from, but I then struggle to get all of the data from these webpages into the format I want, a final data table containing all the scraped data. I use a second for loop to iterate through all urls, use the read_html() function on each page, and then extract the data of interest using html_nodes('pre')%>%html_text(). The problem is that as my code currently works, this just creates a single large object for each page containing the entire scsv as a single object, instead of as rows in a table containing individual columns (week, year, gid, name, pos, team, h/a, opt, dk points, dk salary). I instead want a data table containing these separate columns for all of the pages I am attempting to scrape, but do not have much experience with web scraping and do not know how to resolve this issue. Any help would be greatly appreciated. Below is the code I have written thus far:

library(purrr) 
library(rvest)
library(data.table)
library(stringr)
library(tidyr)


#Declare variables and empty data tables
path1<-("http://rotoguru1.com/cgi-bin/fyday.pl?week=")
seasons<-c("2014", "2015", "2016","2017","2018","2019","2020")
weeks<-1:17
result<-NULL
temp<-NULL

#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
  for(w in 1:length(weeks)){
    temp<- paste0(path1, as.character(w),"&year=",seasons[s],"&game=dk&scsv=1")
    result<-rbind(result,temp)
  }
}

#Get rid of any potential empty values from result
result<-compact(result) 

final<-data.table()
#Create final data table with all injury information
for (i in 1:length(result)){
  page<-read_html(result[i])
  data<-page%>%html_nodes("pre")%>%html_text()
  final<-rbind(data,final)
  
}

2

2 Answers

0
votes

I believe your entire code from the first for-loop cn be replaced with the following (mostly data.table) solution:

result <- CJ(seasons, weeks)[, paste0(path1, weeks, "&year=", seasons, "&game=dk&scsv=1") ]
#loop over result
final <- data.table::rbindlist(
  lapply( result, function(x) {
    read_html(x) %>%
      html_nodes("pre") %>% 
      html_text() %>%
      data.table::fread( sep = ";" ) # <-- !!
    } ),
  use.names = TRUE, fill = TRUE )
0
votes

The page has options to get the html table format, so instead of "&game=dk&scsv=1" in your loop you can use "&game=dk"

Then just phrased the html_table

Here is an example of one page

page<-read_html(result[1])

x<-data.frame(page%>%html_nodes("table")  %>%  `[`(9) %>% html_table(T))
colnames(x)  <- as.character(x[1,])
x <- x[-1,]