Sample a single row, per column, with substantial missing data

Question

As an example of my data frame, which I will call df1, I have GROUP1 with three rows of data, and GROUP2 with two rows of data. I have three variables, X1, X2, and X3:

GROUP          X1    X2   X3
GROUP1         A     NA   NA
GROUP1         NA    NA   T
GROUP1         C     T    G   
GROUP2         NA    NA   C
GROUP2         G     NA   T

I am halfway to my answer, based on a previous question and answer (Sample a single row, per column, within a subset of a data frame in R, while following conditions) except I am having issues using characters.

I would like to sample a single variable, per column from GROUP1, to make a new row representing GROUP1. I do not want to sample one single and complete row from GROUP1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP2, variable X2, above).

For example, after sampling, I could have as a result:

GROUP         X1    X2   X3
GROUP1        A     T    T
GROUP2        G     NA   C

Only GROUP2, variable X2, can result in NA here. I actually have 300 taxa, 40 groups, 160000 variables, and a substantial number of NA's.

When I use:

library(data.table)

setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_character_ else sample(na.omit(x),1)) , by = GROUP]

I end up with a warning:

Column 2 of result for group 2 is type 'character' but expecting type    
'integer'. Column types must be consistent for each group.

However, this warning does not seem to apply to only those variables of groups composed entirely of NA's.

If I instead replace NA_character_ with NA_integer_, some columns result in the sum of non-NA rows for the group's variable, rather a sample from across the rows.

Your code seems to work fine for me with NA_character_ when the X1-3 columns are character data. Are you sure you don't have factors (stored as integers) stuffing this up? — thelatemail

Jota Jota · Accepted Answer · 2016-01-11T00:00:43

You can use this data.table call:

setDT(df1)[ , lapply(.SD, 
  function(x) x[!is.na(x)][sample(sum(!is.na(x)), 1)]), by = GROUP]

Or you can tweak your original one

setDT(df1)[,lapply(.SD, function(x)
  if(all(is.na(x))) NA_character_ 
    else as.character(na.omit(x))[sample(length(na.omit(x)), 1)]) , by = GROUP]

Or using aggregate from base R:

aggregate(df1[ , names(df1) != "GROUP"], by=list(df1$GROUP), 
  function(ii) ifelse(length(na.omit(ii)) == 0, 
    NA,
    as.character(na.omit(ii))[sample(length(na.omit(ii)), 1)])) 
    # Note use of as.character in case of factors
#  Group.1 X1   X2 X3
#1  GROUP1  A    T  T
#2  GROUP2  G <NA>  C

As thelatemail mentioned, the issue you are encountering is most likely due to variables being factors, as your code works when X1-X3 are characters. Any of the above solutions should work with factors.

Sample a single row, per column, with substantial missing data

2 Answers