2
votes

I have a question about check which level is the most.
Here is my data:

Year  Area    V1    V2    V3
2014  USA    100    25    50
2014  USA    200    50    60
2014  USA    200    50    50
2014  USA    200    50    50
2014  USA    300    75    40
2014  ASIA   100    25    60
2014  ASIA   100    25    70
2014  ASIA   300    75    60
2014  ASIA   400    100   60
2014  ASIA   500    125   70
2015  USA    100    25    80
2015  USA    300    75    80
2015  USA    300    75    70
2015  USA    300    75    90
2015  USA    500    125   40
2015  ASIA   400    100   90
2015  ASIA   400    100   80
2015  ASIA   300    75    80
2016  USA    500    125   60
2016  USA    500    125   60
2016  ASIA   100    25    50

What I want is:

Year  Area   V1    V2    V3   Count
2014  USA    200   50    50   5
2015  USA    300   75    80   5
2016  USA    500   125   60   2
2014  ASIA   100   25    60   5
2015  ASIA   400   100   80   3
2016  ASIA   100   25    50   1

In V1, there are 5 levels(100, 200, 300, 400, and 500).
In V2, there are also 5 levels, which are 0.25*V1.
In V3, there are 6 levels.
The result I want is group by Year and Area. Moreover, V1 is the max count of levels. For example, in Year == 2014 and Area == USA, V1 contains 1 level 100, 3 level 200, and 1 level 300. So, the result should be 200 because it is the most one. V2 and V3 are the same.

Any idea?

DATA

dt <- fread("Year  Area    V1    V2    V3
2014  USA    100    25    50
2014  USA    200    50    60
2014  USA    200    50    50
2014  USA    200    50    50
2014  USA    300    75    40
2014  ASIA   100    25    60
2014  ASIA   100    25    70
2014  ASIA   300    75    60
2014  ASIA   400    100   60
2014  ASIA   500    125   70
2015  USA    100    25    80
2015  USA    300    75    80
2015  USA    300    75    70
2015  USA    300    75    90
2015  USA    500    125   40
2015  ASIA   400    100   90
2015  ASIA   400    100   80
2015  ASIA   300    75    80
2016  USA    500    125   60
2016  USA    500    125   60
2016  ASIA   100    25    50")
2
Particularly for V2 and V3! Why aren't they removed if we only retain the Mode of V1 within Year/Area?Vincent Bonhomme

2 Answers

2
votes

Using dplyr we can:

dt %>% 
  # we count the number of item within Year x Area groups
  group_by(Year, Area) %>% 
  mutate(Count=n()) %>% 
  # now we count the number of V1 levels within the Year x Area groups
  group_by(Year, Area, V1) %>% 
  mutate(Count_V1=n()) %>% 
  arrange(desc(Count_V1)) %>% 
  # now we come back to Year x Area grouping and retain the most abundant entry
  group_by(Year, Area) %>% 
  slice(1) %>% 
  ungroup() %>% 
  # cosmetics
  arrange(Area, Year) -> dt2

Which results in:

dt2
# A tibble: 6 x 7
    Year   Area    V1    V2    V3 Count Count_V1
<int> <fctr> <int> <int> <int> <int>    <int>
1  2014   ASIA   100    25    60     5        2
2  2015   ASIA   400   100    90     3        2
3  2016   ASIA   100    25    50     1        1
4  2014    USA   200    50    60     5        3
5  2015    USA   300    75    80     5        3
6  2016    USA   500   125    60     2        2

But as stated in the comments, V2 and V3 values are confusing since they're not based on a clear criterium. From what I understand they can be removed, just like Count_V1:

dt2 %>% select(-Count_V1, -V2, -V3)

# A tibble: 6 x 4
   Year   Area    V1 Count
<int> <fctr> <int> <int>
1  2014   ASIA   100     5
2  2015   ASIA   400     3
3  2016   ASIA   100     1
4  2014    USA   200     5
5  2015    USA   300     5
6  2016    USA   500     2
2
votes

We can get the Mode function from @KenWilliam's answer here.

library(data.table)
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
 }

Grouped by 'Year', 'Area' (using data.table syntax as it is already a data.table from reading with fread), loop through the Subset of data.table (.SD), get the element that has the highest frequency using Mode, similarly get the frequency of columns using table, find the sum, collapse the multiple columns to a single one by taking the maximum of each row (pmax) and concatenate (c) with the list of columns of 'Mode'

dt[, c(lapply(.SD, Mode), 
     Count = do.call(pmax, lapply(.SD, function(x) sum(table(x))))), by = .(Year, Area)]
#    Year Area  V1  V2 V3 Count
#1: 2014  USA 200  50 50     5
#2: 2014 ASIA 100  25 60     5
#3: 2015  USA 300  75 80     5
#4: 2015 ASIA 400 100 80     3
#5: 2016  USA 500 125 60     2
#6: 2016 ASIA 100  25 50     1