0
votes

Our professor keeps giving us assignments to work with in R but instead of giving us easier data we normally have to pull from the web.

This block of code does that:

library(rvest)
url <- "https://www.supremecourt.gov/opinions/slipopinion/18"
page <- read_html(url)
table <- html_table(page, fill = FALSE, trim = TRUE)

However this also gets included in the table data:

table [[1]] X1 1 SEARCH TIPS\r\n Search term too short \r\n Invalid text in search term. Try again X2 1 ADVANCED SEARCHDOCKET SEARCH

So I am having a hard time understanding how to format this data into a data frame because doing something like as.data.frame(table) gives me this error,

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 11, 8, 7, 2

2
Your professor is doing you a favour, real world data is messy :) Assuming that you want the table for each month, it may be better to get the tables using html_nodes("table") with a selector for the desired tables, before using html_table. - neilfws

2 Answers

1
votes

You can use a selector to distinguish the tables with the data from other tables on the page, such as the search box. In this case, the data tables are of class table-bordered:

page %>% 
  html_nodes("table.table-bordered") %>% 
  html_table()
0
votes

I think we can approach it two ways. If you are sure that only the first type of error occurs, you can search for the Search term too short by grepl, and exclude any element in the list table before performing bind_rows.

library(dplyr)
table[unlist(lapply(
  table, 
  function(x) sum(grepl("Search term too short", x))
)) < 1] %>% 
  bind_rows()

Otherwise, because the other 'nice' elements of the list have the same column names/format, you could also use that.

table[unlist(lapply(
  table, 
  function(x) sum(grepl("Docket", names(x))) > 0
))] %>% 
  bind_rows()