I have a csv file, which can be imported into R. It is a dataframe with many columns in the "long form", i.e. there are multiple entries for the same ID. I am reproducing an example of the dataset and the resulting dataset I am trying to obtain, using only the first 5 columns (I actually have a lot more columns in my real data). The original dataset can be reproduced in R with this:
df <- data.frame(id = c("id1","id1","id2","id2","id2","id2","id3","id3"), date = c("30/10/20010 from Steve.","30/16/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve.","06/05/2004 from Allen.","08/09/2005 from Anna.","08/05/2008 from Allen. 30/10/2010 from Bobby.","14/03/2002 from Steve. 23/07/2003 from Anna.","08/08/2002 from Steve.", "08/08/2002 from Anna. 08/08/2002 from Steve."), v1 = c(1,NA,1,1,2,NA,1,2), v2 = c(2,NA,2,NA,NA,NA,2,NA), v3 = c(1,NA,NA,2,NA,1,1,NA), v4 = c("Y","N","N","Y","NA","NA","Y","Y"), v5 = c(0,0,NA,0,0,NA,0,NA))
id date v1 v2 v3 v4 v5
1 id1 30/10/20010 from Steve. 1 2 1 Y 0
2 id1 30/16/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve. NA NA NA N 0
3 id2 06/05/2004 from Allen. 1 2 NA N NA
4 id2 08/09/2005 from Anna. 1 NA 2 Y 0
5 id2 08/05/2008 from Allen. 30/10/2010 from Bobby. 2 NA NA NA 0
6 id2 14/03/2002 from Steve. 23/07/2003 from Anna. NA NA 1 NA NA
7 id3 08/08/2002 from Steve. 1 2 1 Y 0
8 id3 08/08/2002 from Anna. 08/08/2002 from Steve. 2 NA NA Y NA
What I am trying to get is a dataset selecting only the row with the latest date from the column date for each id, and if the dates are all the same, select the row with the least number of NAs (of missing values):
id date v1 v2 v3 v4 v5
1 id1 30/10/20010 1 2 1 Y 0
5 id2 30/10/2010 2 NA NA NA 0
7 id3 08/08/2002 1 2 1 Y 0
I thought of splitting the columns in R using "from" as a separator, but this doesn't really take me anywhere:
try<- strsplit(df$date, "from", fixed=TRUE)
Then I thought that it may be better to select the dates in BASH: but this also does not take me anywhere:
grep "[0-9][0-9]\/[0-9][0-9]\/20[0-9][0-9]" file.csv | less -S
Basically, I am lost in how to approach this. I would be VERY grateful if someone could please suggest the correct approach, hopefully just using BASH or R? Thank you!