1
votes

I'm trying to scrape some high school sports scores from tables on this site but the rvest html_table() function returns nothing...just an empty list. The data seems to clearly sit within a table tag so I thought this would be pretty straightforward, but not so.

html_data <- read_html("https://highschoolsports.nj.com/boysbasketball/schedule/2020/01/09")
html_data %>% html_table(html_data)

Any help or advice as to how to extract this table would be greatly appreciated!

1
You're missing a step or two in that second line where you extract the node that contains the table you want from the full page and then pass that particular node to the call to html_table.ulfelder
Also looks like that table is rendered with javascript, which makes scraping harder. You'll probably need to dig into that topic to get where you want to go.ulfelder

1 Answers

4
votes

The table you see is built dynamically using javascript. The page sends an xhr request for a json file that contains all the data in the table (plus a lot more data that you can't see).

What you need to do is request the json file, parse it and extract the elements you want. The following script will do it for you:

library(tidyverse)
library(httr)
library(rjson)

"https://highschoolsports.nj.com/siteapi/games/schedule" %>%
modify_url( query = list( viewStart      = "1/9/2020",
                          sportId        = "15",
                          schoolId       = "",
                          scheduleYearId = ""))          %>%
GET()                                                    %>%
content("text")                                          %>%
fromJSON()                                               %>%
`[[`("games")                                            %>%
lapply(function(x) data.frame(x$gameDate, x$name))       %>%
{do.call("rbind", .)}                                    %>%
as_tibble                                                 ->
result

print(result)
#> # A tibble: 324 x 2
#>    x.gameDate          x.name                                            
#>    <fct>               <fct>                                             
#>  1 2020-01-09T00:00:00 Manville (43) at Pingry (77)                      
#>  2 2020-01-09T00:00:00 Eastern (41) at Cherokee (54)                     
#>  3 2020-01-09T00:00:00 Woodbridge (31) at Colonia (54)                   
#>  4 2020-01-09T00:00:00 Phillipsburg (64) at Bridgewater-Raritan (71)     
#>  5 2020-01-09T05:30:00 Asbury Park (44) at Point Pleasant Beach (50)     
#>  6 2020-01-09T07:00:00 Montclair Immaculate (78) at Newark East Side (49)
#>  7 2020-01-09T15:45:00 Christian Brothers (67) at Howell (62)            
#>  8 2020-01-09T16:00:00 West Caldwell Tech (59) at Weequahic (60)         
#>  9 2020-01-09T16:00:00 Scotch Plains-Fanwood (20) at Westfield (55)      
#> 10 2020-01-09T16:00:00 Summit (59) at Cranford (44)                      
#> # ... with 314 more rows

If you dig around in the json, it is easy to get the individual scores etc, so if you want a table with this data in data frame columns, you would change the function in the lapply command to select the ones you want as entries in your data frame.