R delete rows with least count based on multiple rows

Question

I have the below dataset:

   Var1  Var2  Var3 Var4
1 Rank 1 Sub 1     0   10
2 Rank 1 Sub 1     0   20
3 Rank 2 Sub 2     0   30
4 Rank 1     0 Sub 1   40
5 Rank 2 Sub 2     0   50
6 Rank 2     0 Sub 2   10

I want to remove the rows that have the least values based on Var2 and Var3. For example, Rank 1 (in Var1) has 2 values in Var2 and 1 value in Var3. I want to remove all entries of Rank 1 that have a value in Var3 and keep all entries that have a value in Var2. The same applies for all other Var1 values.

So the final result will be:

       Var1  Var2  Var3 Var4
    1 Rank 1 Sub 1     0   10
    2 Rank 1 Sub 1     0   20
    3 Rank 2 Sub 2     0   30
    4 Rank 2 Sub 2     0   50

Is there a way to do that? find the code to build the above table below:

Var1 = c("Rank 1", "Rank 1", "Rank 2", "Rank 1", "Rank 2")
Var2 = c("Sub 1", "Sub 1", "Sub 2","0", "Sub 2")
Var3 = c(0, "Sub 1", 0, "Sub 1", "0" )
Var4 = c(10,20, 30, 40,50)
df <- data.frame(Var1,Var2,Var3,Var4)

PS: This will be a very large dataset with multiple entries in both Var2 and Var3

Thanks

So for every Var1 you want to keep those values which has more non zero values in Var2 or Var3? — Ronak Shah
@AndreasAvgousti It seems that the you have shown in beginning and the data created as data.frame are not same. One got 6 rows and another got 5 rows. — MKR

MKR MKR · Accepted Answer · 2018-05-19T11:59:56

Use dplyr package to group on Var1 to count non-zero values for both Var2 and Var3 columns. Based on the which count is greater, the filter criteria can be applied on respective columns. The case_when will make logic simpler and cleaner.

library(dplyr)
df %>% mutate_if(is.factor, as.character) %>%
  group_by(Var1) %>% 
  filter( case_when(
    sum(Var2 != "0") >= sum(Var3 != "0") ~ Var2 != "0",
    sum(Var2 != "0") < sum(Var3 != "0") ~ Var3 != "0"
      ))
    # # A tibble: 4 x 4
# # Groups: Var1 [2]
# Var1   Var2  Var3   Var4
# <chr>  <chr> <chr> <int>
# 1 Rank 1 Sub 1 0        10
# 2 Rank 1 Sub 1 0        20
# 3 Rank 2 Sub 2 0        30
# 4 Rank 2 Sub 2 0        50

Data:

df <- read.table(text = 
"Var1  Var2  Var3 Var4
1 'Rank 1' 'Sub 1'     0   10
2 'Rank 1' 'Sub 1'     0   20
3 'Rank 2' 'Sub 2'     0   30
4 'Rank 1'     0 'Sub 1'   40
5 'Rank 2' 'Sub 2'     0   50
6 'Rank 2'     0 'Sub 2'   10",
stringsAsFactors = FALSE, header = TRUE)

R delete rows with least count based on multiple rows

1 Answers