Select only date and related row based on condition from a long format file

Question

I have a csv file, which can be imported into R. It is a dataframe with many columns in the "long form", i.e. there are multiple entries for the same ID. I am reproducing an example of the dataset and the resulting dataset I am trying to obtain, using only the first 5 columns (I actually have a lot more columns in my real data). The original dataset can be reproduced in R with this:

df <- data.frame(id = c("id1","id1","id2","id2","id2","id2","id3","id3"), date = c("30/10/20010 from Steve.","30/16/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve.","06/05/2004 from Allen.","08/09/2005 from Anna.","08/05/2008 from Allen. 30/10/2010 from Bobby.","14/03/2002 from Steve. 23/07/2003 from Anna.","08/08/2002 from Steve.", "08/08/2002 from Anna. 08/08/2002 from Steve."), v1 = c(1,NA,1,1,2,NA,1,2), v2 = c(2,NA,2,NA,NA,NA,2,NA), v3 = c(1,NA,NA,2,NA,1,1,NA), v4 = c("Y","N","N","Y","NA","NA","Y","Y"), v5 = c(0,0,NA,0,0,NA,0,NA))

id date v1 v2 v3 v4 v5

1 id1 30/10/20010 from Steve. 1 2 1 Y 0

2 id1 30/16/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve. NA NA NA N 0

3 id2 06/05/2004 from Allen. 1 2 NA N NA

4 id2 08/09/2005 from Anna. 1 NA 2 Y 0

5 id2 08/05/2008 from Allen. 30/10/2010 from Bobby. 2 NA NA NA 0

6 id2 14/03/2002 from Steve. 23/07/2003 from Anna. NA NA 1 NA NA

7 id3 08/08/2002 from Steve. 1 2 1 Y 0

8 id3 08/08/2002 from Anna. 08/08/2002 from Steve. 2 NA NA Y NA

What I am trying to get is a dataset selecting only the row with the latest date from the column date for each id, and if the dates are all the same, select the row with the least number of NAs (of missing values):

id date v1 v2 v3 v4 v5

1 id1 30/10/20010 1 2 1 Y 0

5 id2 30/10/2010 2 NA NA NA 0

7 id3 08/08/2002 1 2 1 Y 0

I thought of splitting the columns in R using "from" as a separator, but this doesn't really take me anywhere:

try<- strsplit(df$date, "from", fixed=TRUE)

Then I thought that it may be better to select the dates in BASH: but this also does not take me anywhere:

grep "[0-9][0-9]\/[0-9][0-9]\/20[0-9][0-9]" file.csv | less -S

Basically, I am lost in how to approach this. I would be VERY grateful if someone could please suggest the correct approach, hopefully just using BASH or R? Thank you!

yellowcap yellowcap · Accepted Answer · 2012-03-19T12:30:14

Here is a script that works on your example data

(after correcting two dates: 30/10/20010 --> 30/10/2010 and 30/16/2005 --> 30/06/2005)

df <- data.frame(id = c("id1","id1","id2","id2","id2","id2","id3","id3"), date = c("30/10/2010 from Steve.","30/06/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve.","06/05/2004 from Allen.","08/09/2005 from Anna.","08/05/2008 from Allen. 30/10/2010 from Bobby.","14/03/2002 from Steve. 23/07/2003 from Anna.","08/08/2002 from Steve.", "08/08/2002 from Anna. 08/08/2002 from Steve."), v1 = c(1,NA,1,1,2,NA,1,2), v2 = c(2,NA,2,NA,NA,NA,2,NA), v3 = c(1,NA,NA,2,NA,1,1,NA), v4 = c("Y","N","N","Y","NA","NA","Y","Y"), v5 = c(0,0,NA,0,0,NA,0,NA))

# Convert string to list of dates, extract maximum (earliest date)
df.dates <- sapply(as.character(df$date), function(x) strsplit(x, "\\."))
df.dates <- lapply(df.dates, function(x) as.Date(x, format='%d/%m/%Y'))
df.dates <- lapply(df.dates, max)

# Add to dataframe
df$latest <- unlist(df.dates)

# Count the number of NA values per row
df$naCount <- apply(df, 1, function(x) sum(is.na(x[3:7])))

# Split data and select either maximum (latest) date or minimal NA count
df.split <- split(df, df$id)

df.select <- lapply(df.split, function(x){

    # Select by given criterion
    if(length(unique(x$latest)) == 1){
        y <- x[which(min(x$naCount) == x$naCount),]
        }
    else{
        y <- x[which(max(x$latest) == x$latest),]
    }

    # Check if selection was successful
    if(nrow(y) != 1) cat('Warning: non-unique choice, returning more than one line')

    # Return result
    y
})

# Combine into output dataframe
df.select <- do.call(rbind, df.select)

The result of the script looks like this:

> df.select

     id                                          date v1 v2 v3 v4 v5 latest naCount
id1 id1                        30/10/2010 from Steve.  1  2  1  Y  0  14912       0
id2 id2 08/05/2008 from Allen. 30/10/2010 from Bobby.  2 NA NA NA  0  14912       2
id3 id3                        08/08/2002 from Steve.  1  2  1  Y  0  11907       0

Select only date and related row based on condition from a long format file

1 Answers