2
votes

How should i get the scraped output text into table with columns

library(rvest)
base_url <- c("https://www.sec.gov/Archives/edgar/data/1409916/000162828017002570/exhibit211nobilishealthcor.htm",
              "https://www.sec.gov/Archives/edgar/data/1320695/000156459018002405/ths-ex211_71.htm")


df <- lapply(base_url,function(u){

  html_obj <- read_html(u)
  temp <- html_nodes(html_obj,'text')
  draft1 <- html_text(temp)
  draft1 <- as.data.frame(draft1)
  require(data.table)
  setDT(draft1)

})

Want the output like below in a table with column names

 Sl               Subsidiary                                     Region
    1.           Bay Valley Foods, LLC            Delaware limited liability company
    2.           Sturm Foods                       Wisconsin corporation
    3.           S.T. Specialty Foods              Minnesota corporation
1
Error in FUN(X[[i]], ...) : could not find function "read_html" - Andre Elrico
That error shouldn't come - Gautam Biswas
of course it should. read_html comes from a package that you haven't specified. the same is true for other functions in that anonymous function... - clemens
You should mention the packages you are using here - David Arenburg
i am using the rvest package - Gautam Biswas

1 Answers

0
votes

I used a rvest based solution :

To gather the first url datas :

base_url <- c("https://www.sec.gov/Archives/edgar/data/1409916/000162828017002570/exhibit211nobilishealthcor.htm",
          "https://www.sec.gov/Archives/edgar/data/1320695/000156459018002405/ths-ex211_71.htm")

#SCRAPE FIRST URL
u <- base_url[1]
html_obj <- read_html(u)
tr <- html_obj %>% html_nodes('div[style="line-height:120%;text-
align:center;font-size:10pt;"] tr')
loc <- NULL
interest <- NULL
for (bal in tr) {
val1 <- bal %>% html_nodes('div[style="text-align:left;font-size:10pt;"] font[style="font-family:inherit;font-size:10pt;"]') %>% html_text()
val2 <- bal %>% html_nodes('div[style="text-align:center;font-size:10pt;"] font[style="font-family:inherit;font-size:10pt;"]') %>% html_text()
if(length(val1) != 1) val1 <- "NA"
if(length(val2) != 1) val2 <- "NA"
interest <- c(interest,val1)
loc <- c(loc,val2)
}

#GET THE RESULTS IN A DF
res1 <- data.frame(interest,loc)

Then the script to gather the second url datas

#SCRAPE 2ND URL
html_obj <- read_html(u)
u <- base_url[2]
html_obj <- read_html(u)
text <- html_obj %>% html_nodes("p[style='margin-bottom:0pt;margin-top:12pt;text-indent:0%;color:#000000;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;']") %>% html_text()
res2 <- strsplit(text,split = ", a |\\. a |, an |\\. an ")
res2 <- data.frame(interest = unlist(res2)[seq(1,length(res2),2)],loc = unlist(res2)[seq(2,length(res2),2)])

Hope thats will helps you

Gottavianoni