4
votes

I'm struggling to apply head(1) to a column, aggregation to another column, and logical expression to another column, after groupby.

I have a dataframe df like this:

df <- data.frame(ref = c(rep("123", 3), rep("456", 3), rep("789", 4)),
                 carrier = c("A", "A", "B", "C", "C", "C", "D", "E", "F", "A"), 
                 distance = c(20, 10, 40, 20, 90, 30, 20, 20, 30, 70),
                 stringsAsFactors = FALSE)


>df
ref    carrier    distance
123          A          20
123          A          10
123          B          40
456          C          20
456          C          90
456          C          30
789          D          20
789          E          20
789          F          30
789          A          70

I want to do these things below.

  1. group by ref

  2. mutate a column first_carrier where the first value of the carrier column of each group is returned

  3. mutate a column agg_distance where the aggregated value of distance column in each group is returned

  4. mutate a column plus_100 where TRUE is returned if the agg_distance is greater than 350, FALSE is smaller than 100.

So the result should look like this.

ref  first_carrier  agg_distance  plus_100
123              A            70     FALSE
456              C           140      TRUE  
789              D           140      TRUE

My attempt:

  df_new <- df %>%
    group_by(ref) %>%
    mutate("agg_distance" = summarise(sum(distance)) %>%
    mutate("plus_100" = ifelse(agg_distance >= 100, T, F))

But I'm just not sure how to take the first carrier in each group.

2

2 Answers

2
votes

You almost had it on your own. summarise is used without mutate. To get the first carrier, just call on the first row of the carrier column after grouping.

library(dplyr)

df_new <- df %>%
  group_by(ref) %>%
  summarise(first_carrier = carrier[1],
            agg_distance = sum(distance),
            plus_100 = ifelse(agg_distance >= 100, T, F))

# A tibble: 3 x 4
    ref first_carrier agg_distance plus_100
  <chr>         <chr>        <dbl>    <lgl>
1   123             A           70    FALSE
2   456             C          140     TRUE
3   789             D          140     TRUE
1
votes

This is a data.table verion made from @LAP input:

Please hand LAP the likes

df<-
setDT(df)[,.(first_carrier = carrier[1],
             agg_distance  = sum(distance)),by="ref"][,plus_100 := ifelse(agg_distance >= 100, T, F)]

#> df
#   ref first_carrier agg_distance plus_100
#1: 123             A           70    FALSE
#2: 456             C          140     TRUE
#3: 789             D          140     TRUE