For example, I have a dataset (my real dataset has hundreds of columns and millions of observations):
> data
time_period country var1 var2 var3
1: 2000_Q2 USA april spring ABS
2: 2000_Q2 USA may spring CND
3: 2000_Q3 USA june summer ABS
4: 2005_Q2 Italy april spring ABS
5: 2010_Q2 Italy may spring ABS
6: 2010_Q2 Italy may spring ABS
7: 2012_Q2 Italy may spring ABS
8: 2018_Q4 China october autumn ABS
9: 2018_Q4 China september autumn ARF
10: 2018_Q4 China october autumn ABS
11: 2018_Q4 China october autumn ABS
12: 2000_Q2 USA march spring ABS
I need to group the data
by time_period
and country
and find all variables that will have more than 1 unique value per group (at least for one group). In my dataset, all these variables are categorical with type chr.
I did this in the following way:
colnames_to_group = c("time_period", "country", "var1", "var2", "var3")
data_count = data[, lapply(.SD, uniqueN),
by = c("time_period", "country"),
.SDcols = colnames_to_group]
> data_count
time_period country time_period country var1 var2 var3
1: 2000_Q2 USA 1 1 3 1 2
2: 2000_Q3 USA 1 1 1 1 1
3: 2005_Q2 Italy 1 1 1 1 1
4: 2010_Q2 Italy 1 1 1 1 1
5: 2012_Q2 Italy 1 1 1 1 1
6: 2018_Q4 China 1 1 2 1 2
tokeep <- which(sapply(data_count, is.numeric))
multiple_values <- sapply(data_count[ , tokeep, with = FALSE], function(x) any(x > 1))
> multiple_values
time_period country var1 var2 var3
FALSE FALSE TRUE FALSE TRUE
colnames_with_multiple_values <<- names(multiple_values[(multiple_values == TRUE)])
> colnames_with_multiple_values
[1] "var1" "var3"
This works and this is what I want to get. But for huge datasets, this works for a very long time.
Are there any ways to implement this more optimally?
Thanks for the any help!
uniqueN
– akrununiqueN()
is very slow compared tolength(unique())
– markusdata_count
with functionuniqueN()
– red_quarkn_distinct
function works much faster. – red_quark