2
votes

I have data where I have one element of interest (111) in a column. My data looks something like this:

pcp2 <- data.frame(A = c(rep(111, 4), rep(222, 5), rep(111,5), 
        rep(222,5)),B= c(rep(1,9), rep(2,10))) 

     A  B
1   111 1
2   111 1
3   111 1
4   111 1
5   222 1
6   222 1
7   222 1
8   222 1
9   222 1
10  111 2
11  111 2
12  111 2
13  111 2
14  111 2
15  222 2
16  222 2
17  222 2
18  222 2
19  222 2

I want to collapse all of only variable 222 and leave all 111 untouched throughout my data based upon the IDs in column B like so:

     A  B
1   111 1
2   111 1
3   111 1
4   111 1
5   222 1
6   111 2
7   111 2
8   111 2
9   111 2
10  111 2
11  222 2

All I have been able to find that is close reduces all the variables (both 111 and 222).

library(data.table)
dat <- as.data.table(pcp2, key = "B")
data <- dat[, by = key(dat)][!duplicated(A == "222")]

as follows:

   A    B
1 111   1
2 222   2

I've played around with various iterations of this code and others, but other things I've tried as well have just gotten my data down to two rows or collapsed both 111 and 222. I.e. this wouldn't be sufficient:

   A  B
1 111 1
2 222 2
3 111 2
4 222 2

Does anyone have any guidance for how you would maintain the 111 and collapse the 222 within one column based on another column like the example (B in this case)? I know other examples are similar but none seem to give the option to specify not collapsing a particular variable within one column while collapsing the other(s).

3
if you change pcp2[7, 2] = 2 - do you expect to get 3 rows with A = 222 at the end, or just 2?eddi
Just 2. I would expect one value of 222 PER each unique identifier in column B, no more though. So for this one 222 for when column B is 1 and one 222 for when column B is 2.cebola

3 Answers

3
votes

You can add a counter for the uncollapsed rows:

dat[, r := rowid(A)*(A == 111)]
unique(dat, by=c("B","r"))[, !"r"]

      A B
 1: 111 1
 2: 111 1
 3: 111 1
 4: 111 1
 5: 222 1
 6: 111 2
 7: 111 2
 8: 111 2
 9: 111 2
10: 111 2
11: 222 2

(This requires data.table, while Gregor's answer works in base R as well.)

2
votes

I would phrase your goal as "keep rows that are not duplicated or where A is 111", with the understanding that duplicated is FALSE for the first occurrence.

duplicated(pcp2) looks at all columns, so we don't need to worry about the "within each B group" stuff. If you have other columns hanging around, only give duplicated the relevant columns, something like duplicated(pcp2[c("A", "B")]).

pcp2[!duplicated(pcp2) | pcp2$A == 111, ]
#      A B
# 1  111 1
# 2  111 1
# 3  111 1
# 4  111 1
# 5  222 1
# 10 111 2
# 11 111 2
# 12 111 2
# 13 111 2
# 14 111 2
# 15 222 2
0
votes

Here's a dplyr solution using slice. This solution does not assume that the other values of A other than 111 are 222:

library(dplyr)

pcp2 %>%
  group_by(B, A) %>%
  slice(if(any(A == "111")) 1:n() else 1)

Alternatively using data.table:

library(data.table)

setDT(pcp2)[, .I[if(any(A=="111")) 1:.N else 1], by = c("B", "A")][,.SD, .SDcols = A:B]

Result:

# A tibble: 12 x 2
# Groups:   B, A [5]
       A     B
   <dbl> <dbl>
 1   111     1
 2   111     1
 3   111     1
 4   111     1
 5   222     1
 6   111     2
 7   111     2
 8   111     2
 9   111     2
10   111     2
11   222     2
12   333     2

      A B
 1: 111 1
 2: 111 1
 3: 111 1
 4: 111 1
 5: 222 1
 6: 111 2
 7: 111 2
 8: 111 2
 9: 111 2
10: 111 2
11: 222 2
12: 333 2

Data:

pcp2 <- data.frame(A = c(rep(111, 4), rep(222, 5), rep(111,5), 
                         rep(222,5), rep(333, 5)), B= c(rep(1,9), rep(2,10), rep(2, 5)))