0
votes

I'm trying to scrape HTML tables from different football teams. Here is the table I want to scrape, however I want to scrape that same table from all of the teams to ultimately create a single CSV file that has the player names and their data.

http://www.pro-football-reference.com/teams/tam/2016_draft.htm

# teams
teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE", "HTX", "OTI", "CLT", "JAX", "DAL", "NYG", "WAS", "PHI")

# loop
for(i in teams) {
  url <-paste0("http://www.pro-football-reference.com/teams/", i,"/2016-snap-counts.htm#snap_counts::none", sep="")
  webpage <- read_html(url)

  # grab table
  sb_table <- html_nodes(webpage, 'table')
html_table(sb_table)
head(sb_table)
  # bind to dataframe
df <- rbind(df, sb_table)
}

I'm getting an error thought that I should use CSS or Xpath and not both, but I can't figure out where the problem is exactly (I suspect the html_nodes command). Can anyone help me fix this problem?

2
Where is df from? - Cyrus Mohammadian
Based on your example URL, should the abbreviations in teams not be lower-case? - neilfws
You will need to define df<-data.frame() outside your loop or you will overwrite it on each iteration. - Dave2e
"Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials. SRL reserves the right to revoke the exceptions granted in this paragraph" - hrbrmstr

2 Answers

1
votes

I think that your urls are badly built and, in addition, that the names of the teams are case sensitive. Could you try something like this instead ?

library(rvest)
library(magrittr)

# teams
teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE", "HTX", "OTI", "CLT", "JAX", "DAL", "NYG", "WAS", "PHI")

tables <- list()
index <- 1
for(i in teams){
  try({
  url <- paste0("http://www.pro-football-reference.com/teams/", tolower(i), "/2016_draft.htm")
  table <- url %>% 
    read_html() %>% 
    html_table(fill = TRUE)

  tables[index] <- table

  index <- index + 1

  })
}

df <- do.call("rbind", tables)

PS: I do not understand why this question is downvoted. It seems well formulated ...

0
votes

I think the appropriate CSS selector in this case is #snap_counts. Also if there is one table per page, you can use html_node() (singular, not nodes):

url %>% 
  read_html() %>% 
  html_node("#snap_counts") %>% 
  html_table(header = FALSE)

Since the table has two header rows and some header cells span columns, it's probably best to use header = FALSE. The first 2 rows of the data frame will contain the headers and you can clean up manually (create your own column names).