1
votes

I'm new to R so would be grateful for your help to understand what is going on! I have a dataframe that is very big, the structure looks like this:

Var1 Var2
(chr) (int)
A , 1
A , 2
A , 3
B , 4
B , 5
C , 6
C , 7
C , 8
C , 9
...

I want to create a new dataframe that groups the Var 1 categorical values together, and all the corresponding integer Var2 values into numerical vectors

I am hoping it looks like: Var1 Var2_Combined
(chr) (int)
A , vector[1, 2, 3]
B , vector[4, 5]
C , vector[6, 7, 8, 9]
etc.

Because the dataset is large, i don't want to assign each vector manually and i want to do it through a function. I've tried the following, but it hasn't worked.

1. Convert to string

write.csv(aggregate(df$Var2 ~ df$Var1, FUN = toString), file = "Test_file")    

but i couldn't convert the string back into useable numerics using as.numeric() or as.integer() or any of these types of commands.

2. Concatenate
I tried to do it with the c()

write.csv(aggregate(df$Var2 ~ df$Var1, FUN = c), file = "Test_file")    

While it match up all the Var2 values to unique values in Var1, it created a bunch of new columns rather than a column combining those values into vectors:

Var1 Var2 Var3 Var4 Var5 etc
(chr) (int) (int) (int) etc
A , 1 , 2 , 3 etc
B , 1 , 2 , 3 etc

3. a for loop

I tried to use the unique() filter and a 'for' loop, but it just returned unusable numbers

Var1_Unique <- unique(df$Var1)    
Var2_Combined <- numeric(length = length(Var1_Unique))    

for (i in seq(1, length(Var1_Unique))) {         
   Var2_Combined[i] <- df %>% filter(Var2 == Var1_Unique[i]) 
   }   

I only have dplyr attached at the moment.

Thank you

1

1 Answers

0
votes

There are 2 options :

1. Store the data in a list.

A. Using base R :

df1 <- aggregate(Var2~Var1, df, list)
df1

#  Var1       Var2
#1    A    1, 2, 3
#2    B       4, 5
#3    C 6, 7, 8, 9

str(df1)
#'data.frame':  3 obs. of  2 variables:
# $ Var1: chr  "A" "B" "C"
# $ Var2:List of 3
#  ..$ : int  1 2 3
#  ..$ : int  4 5
#  ..$ : int  6 7 8 9

Now get the data back as original.

df2 <- transform(df1[rep(1:nrow(df1), lengths(df1$Var2)), ], 
                 Var2 = unlist(df1$Var2))
str(df2)
#'data.frame':  9 obs. of  2 variables:
# $ Var1: chr  "A" "A" "A" "B" ...
# $ Var2: int  1 2 3 4 5 6 7 8 9

B. Using tidyverse.

library(dplyr)
library(tidyr)
df1 <- df %>% group_by(Var1) %>% summarise(Var2 = list(Var2))
df2 <- df1 %>% unnest(Var2)

2. Store the data as a string.

A. Using base R

df1 <- aggregate(Var2~Var1, df, toString)
str(df1)
#'data.frame':  3 obs. of  2 variables:
# $ Var1: chr  "A" "B" "C"
# $ Var2: chr  "1, 2, 3" "4, 5" "6, 7, 8, 9"

Get it back to original format.

tmp <- strsplit(df1$Var2, ', ')
df2 <- transform(df1[rep(1:nrow(df1), lengths(tmp)),], 
                 Var2 = as.numeric(unlist(tmp)))
str(df2)
#'data.frame':  9 obs. of  2 variables:
# $ Var1: chr  "A" "A" "A" "B" ...
# $ Var2: num  1 2 3 4 5 6 7 8 9

B. Using tidyverse :

df1 <- df %>% group_by(Var1) %>% summarise(Var2 = toString(Var2))
df2 <- df1 %>% separate_rows(Var2, sep = ', ', convert = TRUE)

You can use both the options if you want to keep the data in R only. If you want to write intermediate results to csv for df1 you cannot use option 1 because write.csv would not be able to write list columns to csv in which case you need to use option 2.