I am parsing multiple html files using rvest to produce a dataframe with a column representing a body of text and a column representing a date. The problem I encountered is that while the position of the date is consistent from file to file, the body of text is not. The result is a dataframe with an inconsistent text column (a mix of article bodies and titles) and a consistent year column, as well as a list of unequal length with the "missing" data that I would like to substitute.
Here's a simplification of what I have:
df <- data.frame(text=c("body1", "title1", "body3", "body4", "title2"),
year=c("2016", "2016", "2016", "2017", "2017"))
missing <- list(c("body2", "body5"))
The order of the elements in the list corresponds to the order I'd like them to replace the undesired values in the dataframe. Thus I'd like to identify every value in the text column beginning with the string "title" and replace it with the "missing" text, in order. The result I'm looking for would look like this:
> df
text year
1 body1 2016
2 body2 2016
3 body3 2016
4 body4 2017
5 body5 2017
I can easily identify the values I want to replace with the following:
df$text[grep("title",df$text)] <- NA
But how to fill these values in the right order is where I get lost. I tried the following, which replaces NAs with all values from the list:
> df$text <- as.character(df$text)
> df$text[is.na(df$text)] <- missing
> df
text year
1 body1 2016
2 body2, body5 2016
3 body3 2016
4 body4 2017
5 body2, body5 2017
I'm not sure what the next step is, any help is appreciated.