Creating a new column that has continuous sequence and rep based on multiple column values

Question

I'm currently in a bit of a rut when it comes to R coding. I have been trying to use mutate, seq, and rep functions to generate a new column that iterates over multiple column values and different conditionals, but it has not come out correct. I have a few snippet of my data below:

library(tidyverse)
library(data.table)
library(stringr)

lipidData <- data.frame("Type"=c(rep("LDL",5),rep("HDL",5)),
                        "featureID"=c(12,12,12,12,13,13,14,15,16,17),
                        "featureID2"=c(21,22,23,26,31,31,31,31,38,40))
lipidWrong <- lipidData %>%
group_by(Type,featureID) %>% 
group_by(Type,featureID2) %>% 
mutate(lipidName=paste0(rep("lipid",n()),"_",seq(1,n())))
lipidWrong
  Type  featureID featureID2 lipidName
   <fct>     <dbl>      <dbl> <chr>    
 1 LDL          12         21 lipid_1  
 2 LDL          12         22 lipid_1  
 3 LDL          12         23 lipid_1  
 4 LDL          12         26 lipid_1  
 5 LDL          13         31 lipid_1  
 6 HDL          13         31 lipid_1  
 7 HDL          14         31 lipid_2  
 8 HDL          15         31 lipid_3  
 9 HDL          16         38 lipid_1  
10 HDL          17         40 lipid_1

Instead of that incorrect data table, I would like to have the lipidName be grouped by Type and featureID and then looking at Type feature ID2. If they have the same type and featureID, then count them as the same lipid for lipidName. If they have the same type and featureID2, then count them as the same lipid for lipidName. Since my real dataset includes >100,000 lines, it would be great to know how to sequence the numbers over the entire dataset and not just the n() results from group_by.

I would like to see my results as:

lipidCorrect
   Type featureID featureID2 lipidName
1   LDL        12         21   lipid_1 # same type and featureID
2   LDL        12         22   lipid_1 # same type and featureID
3   LDL        12         23   lipid_1 # same type and featureID
4   LDL        12         26   lipid_1 # same type and featureID
5   LDL        13         31   lipid_2 # although featureID is the same with row6, it has a different type
6   HDL        13         31   lipid_3 # same type and featureID2
7   HDL        14         31   lipid_3 # same type and featureID2
8   HDL        15         31   lipid_3 # same type and featureID2
9   HDL        16         38   lipid_4 
10  HDL        17         40   lipid_5

Please let me know if I'm doing anything wrong with my group_by() and mutate(), and also please let me know of a better way to produce the desired results.

Thanks!

Making sure I understand: Two rows will have the same lipidName if they (a) have the same type AND (b) either have the same featureID or the same featureID2. Is that correct? — Gregor Thomas
Note: The second group_by() will override your the first grouping. — TTS
Your example data shows all lipids of the same name sharing either featureID 1 or 2, but are chains possible? E.g., (all within the same type, and abbreviating featureID as "fID"), X has fID = 30, fID2 = 50, Y has fID = 31, fID2 = 50, Z has fID = 31, fID2 = 51, do they all have the same lipid name even though X is only connected to Z via Y? — Gregor Thomas
@gregorthomas What I'm trying to say is that chaining is not allowed. — Harper Fauni
Great, nice and clear about the chains. My current understanding: First, group by type and fID1, assign lipidNames to all groups with more than 1 row, with all rows within each group getting the same name, and the names iterate the number after the _ between groups. THEN group by type and fID2 and repeat the process only for those rows that don't already have lipidNames. Does this sound right? — Gregor Thomas

rjen rjen · Accepted Answer · 2020-10-21T21:11:58

If I understand the question correctly (using the nice clarifying questions and comments by @Gregor Thomas), a (clumsy) solution based in the tidyverse could look as follows.

library(dplyr)
library(stringr)

lipidData %>%
  group_by(Type, featureID) %>%
  mutate(lipidGroup1 = +(n() > 1)) %>%
  group_by(Type, featureID2) %>%
  mutate(lipidGroup2 = +(n() > 1)) %>%
  ungroup() %>%
  mutate(lipidGroup3 = +(lipidGroup1 == 0 & lipidGroup2 == 0)) %>%
  group_by(Type, featureID) %>%
  mutate(lipidGroup1 = if_else(n() > 1 & row_number() == min(row.names(.)), 1, 0)) %>%
  group_by(Type, featureID2) %>%
  mutate(lipidGroup2 = if_else(n() > 1 & row_number() == min(row.names(.)), 1, 0)) %>%
  ungroup() %>%
  mutate(lipidName = str_c('lipid_', cumsum(lipidGroup1 + lipidGroup2 + lipidGroup3))) %>%
  select(-starts_with('lipidGroup'))

#    Type  featureID featureID2 lipidName
#    <chr>     <dbl>      <dbl> <chr>    
#  1 LDL          12         21 lipid_1  
#  2 LDL          12         22 lipid_1  
#  3 LDL          12         23 lipid_1  
#  4 LDL          12         26 lipid_1  
#  5 LDL          13         31 lipid_2  
#  6 HDL          13         31 lipid_3  
#  7 HDL          14         31 lipid_3  
#  8 HDL          15         31 lipid_3  
#  9 HDL          16         38 lipid_4  
# 10 HDL          17         40 lipid_5

Creating a new column that has continuous sequence and rep based on multiple column values

2 Answers