8
votes

I have some data which looks like this (fake data for example's sake):

dressId        color 
6              yellow 
9              red
10             green 
10             purple 
10             yellow 
12             purple 
12             red 

where color is a factor vector. It is not guaranteed that all possible levels of the factor actually appear in the data (e.g. the color "blue" could also be one of the levels).

I need a list of vectors which groups the available colors of each dress:

[[1]]
yellow  

[[2]] 
red    

[[3]] 
green purple yellow 

[[4]] 
purple red 

Preserving the IDs of the dresses would be nice (e.g. a dataframe where this list is the second column and the IDs are the first), but not necessary.

I wrote a loop which goes through the dataframe row for row, and while the next ID is the same, it adds the color to a vector. (I am sure that the data is sorted by ID). When the ID in the first column changes, it adds the vector to a list:

result <- NULL 
while(blah blah) 
{
    some code which creates the vector called "colors" 
    result[[dressCounter]] <- colors 
    dressCounter <- dressCounter + 1
}

After wrestling with getting all the necessary counting variables correct, I found out to my dismay that it doesn't work. The first time, colors is

[1] yellow
Levels: green yellow purple red blue

and it gets coerced into an integer, so result becomes 2.

In the second loop repetition, colors only contains red, and result becomes a simple integer vector, [1] 2 4.

In the third repetition, colors is a vector now,

[1] green  purple yellow
Levels: green yellow purple red blue 

and I get

result[[3]] <- colors

Error in result[[3]] <- colors :
more elements supplied than there are to replace

What am I doing wrong? Is there a way to initialize result so it doesn't get converted into a numeric vector, but becomes a list of vectors?

Also, is there another way to do the whole thing than "roll my own"?

4

4 Answers

9
votes

split.data.frame is a good way to organize this; then extract the color component.

d <- data.frame(dressId=c(6,9,10,10,10,12,12),
               color=factor(c("yellow","red","green",
                              "purple","yellow",
                              "purple","red"),
                 levels=c("red","orange","yellow",
                          "green","blue","purple")))

I think the version you want is actually this:

ss <- split.data.frame(d,d$dressId)

You can get something more like the list you requested by extracting the color component:

lapply(ss,"[[","color")
6
votes

In addition to split, you should consider aggregate. Use c or I as the aggregation function to get your list column:

out <- aggregate(color ~ dressId, mydf, c)
out
#   dressId                 color
# 1       6                yellow
# 2       9                   red
# 3      10 green, purple, yellow
# 4      12           purple, red
str(out)
# 'data.frame': 4 obs. of  2 variables:
#  $ dressId: int  6 9 10 12
#  $ color  :List of 4
#   ..$ 0: chr "yellow"
#   ..$ 1: chr "red"
#   ..$ 2: chr  "green" "purple" "yellow"
#   ..$ 3: chr  "purple" "red"
out$color
# $`0`
# [1] "yellow"
# 
# $`1`
# [1] "red"
# 
# $`2`
# [1] "green"  "purple" "yellow"
# 
# $`3`
# [1] "purple" "red" 

Note: This works even if the "color" variable is a factor, as in Ben's sample data (I missed that point when I posted the answer above) but you need to use I as the aggregation function instead of c:

out <- aggregate(color ~ dressId, d, I)
str(out)
# 'data.frame': 4 obs. of  2 variables:
#  $ dressId: num  6 9 10 12
#  $ color  :List of 4
#   ..$ 0: Factor w/ 6 levels "red","orange",..: 3
#   ..$ 1: Factor w/ 6 levels "red","orange",..: 1
#   ..$ 2: Factor w/ 6 levels "red","orange",..: 4 6 3
#   ..$ 3: Factor w/ 6 levels "red","orange",..: 6 1
out$color
# $`0`
# [1] yellow
# Levels: red orange yellow green blue purple
# 
# $`1`
# [1] red
# Levels: red orange yellow green blue purple
# 
# $`2`
# [1] green  purple yellow
# Levels: red orange yellow green blue purple
# 
# $`3`
# [1] purple red   
# Levels: red orange yellow green blue purple

Strangely, however, the default display shows the integer values:

out
#   dressId   color
# 1       6       3
# 2       9       1
# 3      10 4, 6, 3
# 4      12    6, 1
5
votes

Assuming your data frame is saved in a variable called df, then you can use simply group_by and summarize with list function of dplyr package like this

library('dplyr')

df %>%
  group_by(dressId) %>%
  summarize(colors = list(color))

Applied to your example:

df <- tribble(
  ~dressId, ~color,
         6, 'yellow',
         9, 'red',
        10, 'green',
        10, 'purple',
        10, 'yellow',
        12, 'purple',
        12, 'red'
)

df %>%
  group_by(dressId) %>%
  summarize(colors = list(color))

# dressId                colors
#       6                yellow
#       9                   red
#      10 green, purple, yellow
#      12           purple, red
0
votes

I am afraid that the answer should be a little different, you should use the following code to accomplish the requested behaviour

df %>%
group_by(dressId) %>%
summarize(colors = toString(unique(color)))