I would like to replace missing values in my data.table for each group and fill in value depending on whether all values in group are missing, or just some values in group are missing.
I can solve the problem but am open to a better code(in terms of speed/memory/readability/flexibility).
I'm stubborn and i'd prefer the data.table solution :)
example:
Starting dataset:
it is a data.table with this structure:
dt = data.table(
grouping_1 = sort(rep(c('a', 'b', 'c'), 4)),
grouping_2 = c(1,1,2,2,1,1,2,2,1,1,2,2),
value_1 = c(NA, NA, NA, NA, NA, 1, 2, NA, 3, 2,4,NA),
value_2 = c(NA, 2, NA, NA, 2, 5, 2, 7, 10, 5,NA, NA)
)
That looks like this:
grouping_1 grouping_2 value_1 value_2
1: a 1 NA NA
2: a 1 NA 2
3: a 2 NA NA
4: a 2 NA NA
5: b 1 NA 2
6: b 1 1 5
7: b 2 2 2
8: b 2 NA 7
9: c 1 3 10
10: c 1 2 5
11: c 2 4 NA
12: c 2 NA NA
What i want to do with is:
I want to group it by columns grouping_1
and grouping_2
and replace missing values in in columns value_1
and value_2
.
If there is no non-missing value for given group (e.g. group grrouping_1==a & grouping_2==1
), i want to replace all NAs for this group by the value of 9000.
If there are some non-missing values for given group, i want to replace missing values by 800 if grouping_2==1
and replace by -800 (negative 800) if grouping_2==2
. If the value is not missing, i do not want to change it.
How can i do it now:
I wrote the following function, which i then apply to each column where i want to fill in missing values. The function changes the original dataset by reference:
filler_so = function(
data, # the dataset that we will be changing
column, # the column we will be filling in
placeholder_col ='drop_at_the_end', # some temporary column that will disappear in the end
missing_fully = 9000, # value to fill in when all values in group missing
missing_partially_g2_1 = 800, # value to fill when grouping_2 = 1
missing_partially_g2_2 = -800, # value to fill when grouping_2 = 2
g2_col = 'grouping_2', # name of column corresponding to grouping_2 from my example
group_cols = c('grouping_1', 'grouping_2') # names of columns to group by
){
# identify for given column whether all values in group are missing,
# or only some are misisng. The value will be either Infinity (all missig),
# or a real number (none or some missing).
# this info is put in a placeholder column
data[, (placeholder_col) := min(get(column), na.rm = T), by = group_cols]
# if value on a given row is missing, but not all missing in group,
# then fill in the values based on what group is in 2nd grouping column
data[
is.na(get(column)) & (get(placeholder_col) != Inf),
(placeholder_col) := (get(g2_col) == 2) * missing_partially_g2_2 +
(get(g2_col) ==1) * missing_partially_g2_1]
# if all values in group are missing, fill in the "missing_fully" value
data[get(placeholder_col) == Inf, (placeholder_col) := missing_fully]
# put into placeholder column the values that were originally not missing
data[!is.na(get(column)), (placeholder_col) := get(column)]
# drop the original column
data[, (column):=NULL]
# rename the placeholder column to the name of original column
setnames(data, placeholder_col, column)
# if i don't put this here,
# then sometimes the function doesn't return results properly.
# i have no clue why.
data
}
To apply this function i need to identify the columns to be filled, which i do like this:
cols_to_fill = colnames(dt)[grep('^value', colnames(dt))]
And lapply like so:
lapply(cols_to_fill, function(x) filler_so(dt, x))
The result:
> dt
grouping_1 grouping_2 value_1 value_2
1: a 1 9000 800
2: a 1 9000 2
3: a 2 9000 9000
4: a 2 9000 9000
5: b 1 800 2
6: b 1 1 5
7: b 2 2 2
8: b 2 -800 7
9: c 1 3 10
10: c 1 2 5
11: c 2 4 9000
12: c 2 -800 9000
What i would like to improve:
- My function works but is quite verbose, i was hoping i could turn the code into less lines
- The function if not very flexible - it would be better to pass something like a named vector to specify the replacing logic and values to fill based on
grouping_2
- I'm looking for speed and memory gains. (for instance, there might be faster way how to identify groups with all values missing, then to run `min(..., na.rm = TRUE) and then check when it is Infinity.
- lapply prints out the changed data.table for each column i'm filling in, which spams the console quite a bit.
- even if the problem at 4 is fixed, i would like to know if there is a way how instead of lapply i could do this using something like
dt[..., (some_column_names) := lapply(.SD, ...), .SDcols = cols_to_fill]
- i am open to any other improvements i haven't thought of.