1
votes

I have a dataset with two columns, metro, state. I give the following command in dplyr,

data %>% group_by(metro, State) %>% summarise(count = n())

I get the following output,

metro           State         count 
A                OH            703
A                NJ              3
B                GA           1453
B                CA            456
B                WA            123

I now want to filter out the rows in the dataframe which are only the maximum counts and leave out the remaining. I need to filter out the corresponding rows. The output after filtering out the rows for the following command should be,

data %>% group_by(metro, State) %>% summarise(count = n())

   metro           State         count 
    A                OH            703
    B                GA           1453

Where every metro has only state which is the state with maximum counts and remaining are removed.

The following is my trying,

data %>% group_by(metro, State) %>% filter(n() == max(n()))

But this is again giving out the same dataframe as input.

Can anybody help me in doing this? My output should be every metro should have a unique state which has the maximum counts and the remaining state entries should be removed.

Thanks

2
data %>% group_by(metro) %>% filter(count == max(count))Psidom
@Psidom This still gives me the same output because when we group by metro, the count adds up and we can't filter out maximum entries.haimen
@Psidom I am able to filter out after summarizing.. My question is to filter out the original dataframe with the corresponding rows. If previously the dataframe had 2738 rows, I need it to have 2156 rows only after filtering. THe rows corresponding to (3, 456, 123) counts should be removed .haimen
Please add a reproducible example.mkt

2 Answers

4
votes

You need a double stage groupby, firstly groupby metro and state get the count and then groupby metro and filter out count that is not equal to the max count within each metro:

data1 <- data %>% group_by(metro, State) %>% mutate(count = n()) %>% 
                  group_by(metro) %>% filter(count == max(count))

nrow(data1)
0
votes

We can also use data.table

library(data.table)
setDT(data)[,  count := .N , .(metro, state)][,  .SD[count == max(count)] , .(metro)]