I have a dataset with some repeated values on Date variable, so I would like to filter this rows based on several conditions. As an example, the dataframe looks like:
df <- read.table(text =
"Date column_A column_B column_C Column_D
1 2020-01-01 10 15 15 20
2 2020-01-02 10 15 15 20
3 2020-01-03 10 13 15 20
4 2020-01-04 10 15 15 20
5 2020-01-05 NA 14 15 20
6 2020-01-05 7 NA NA 28
7 2020-01-06 10 15 15 20
8 2020-01-07 10 15 15 20
9 2020-01-07 10 NA NA 20
10 2020-01-08 10 15 15 20", header=TRUE)
df$Date <- as.Date(df$Date)
The different conditions to filter should be, ONLY on duplicated rows:
- If "column A" is NA and the other numeric, select the numeric row
- If both values are similar(both NA or both numeric), select row with less NAs.
My best approach, after several options is:
df$cnt_na <- apply(df[,2:5], 1, function(x) sum(is.na(x)))
df <- df %>% group_by(Date) %>% slice(which.min(all_of(cnt_na))) %>% select(-cnt_na)
Although in my case, it doesn't do the first condition. The main problem is that if I filter by !is.na(Date), I also remove other not duplicated rows.
Thanks in advance