1
votes

Question:

I have the following data frame that I want to simplify

Fruit <-  c("Apple","Apple","Orange","Orange","Banana","Banana")
Farmer <- c("Bob","Ben","Bill","Bob","George","Bob")
Tons.Jan <- c(20,40,10,20,35,15)
Tons.Feb <- c(30,40,20,15,25,30)
Tons.Mar <- c(10,10,15,10,20,30)
Tons.Apr <- c(15,20,15,30,30,30)
Tons.May <- c(20,5,20,20,20,10)

df <- cbind(Fruit,Farmer)
df <- cbind(df,Tons.Jan)
df <- cbind(df,Tons.Feb)
df <- cbind(df,Tons.Mar)
df <- cbind(df,Tons.Apr)
df <- tbl_df(cbind(df,Tons.May))

I want to be able to summarise the Farmers to a single comma delimited strong and sum the Tons to the observations so that it looks like the following

I want to get to the following

Fruit2 <- c("Apple","Orange","Banana")
Farmer2 <- c("Bob,Ben","Bill,Bob","George,Bob")
Tons.Jan2 <- c(60,30,50)
Tons.Feb2 <- c(70,35,55)
Tons.Mar2 <- c(20,25,50)
Tons.Apr2 <- c(35,45,60)
Tons.May2 <- c(25,40,30)

df2 <- cbind(Fruit2,Farmer2)
df2 <- cbind(df2,Tons.Jan2)
df2 <- cbind(df2,Tons.Feb2)
df2 <- cbind(df2,Tons.Mar2)
df2 <- cbind(df2,Tons.Apr2)
df2 <- tbl_df(cbind(df2,Tons.May2))

What I have tried:

I have tried using dplyr function group_by and summarise_each below

df <- df %>% group_by(Fruit) %>%
   summarise_each_(funs(toString))

However I am not sure how to integrate summing the numeric values without specifically calling out each column using the summarise function,

Any Help is appreciated.

2

2 Answers

2
votes
library(dplyr)

# Convert the relevant columns to numeric
df <- mutate_each(df, funs(as.numeric), -Fruit, -Farmer)

# or as mentioned in the comments by jazzurro
df <- mutate_at(df, vars(starts_with("Tons")), as.numeric)

df %>% 
    group_by(Fruit) %>% 
    mutate(Farmer = toString(Farmer)) %>%
    group_by(Fruit, Farmer) %>%
    summarise_all(funs(sum))


#Source: local data frame [3 x 7]
#Groups: Fruit [?]
#
#   Fruit      Farmer Tons.Jan Tons.Feb Tons.Mar Tons.Apr Tons.May
#   <chr>       <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#1  Apple    Bob, Ben       60       70       20       35       25
#2 Banana George, Bob       50       55       50       60       30
#3 Orange   Bill, Bob       30       35       25       45       40
2
votes

It is better not to do the data.frame(cbind( or tbl_df(cbind as the cbind binds the vectors to a matrix and matrix can hold only a single class, so while we change matrix to data.frame (with the default option i.e. stringsAsFactors=TRUE), if there are any character vector, the matrix will be all character class columns and this gets worsened as the columns are now factor class with data.frame conversion. So, we unnecessarily have to do as.numeric(as.character( for changing the type of numeric columns. It is better to construct the 'data.frame' as

data.frame(Fruit, Farmer, Tons.Jan, ...)

A data.table solution would be

library(data.table)
setDT(df)[,  Farmer :=  toString(Farmer), by = Fruit][ , 
     lapply(.SD, function(x) sum(as.numeric(as.character(x)))) , .(Fruit, Farmer)]
#    Fruit                   Farmer Tons.Jan Tons.Feb Tons.Mar Tons.Apr Tons.May
#1:  Apple       Bob, Ben, Bob, Ben       60       70       20       35       25
#2: Orange     Bill, Bob, Bill, Bob       30       35       25       45       40
#3: Banana George, Bob, George, Bob       50       55       50       60       30

Also, this can be done in a single step with grouping by 'Fruit' (based on the OP's output)

setDT(df)[, c(Farmer = toString(Farmer), lapply(.SD[, 
   setdiff(names(.SD), "Farmer"), with = FALSE], 
       function(x) sum(as.numeric(as.character(x))))), .(Fruit)]
#    Fruit      Farmer Tons.Jan Tons.Feb Tons.Mar Tons.Apr Tons.May
#1:  Apple    Bob, Ben       60       70       20       35       25
#2: Orange   Bill, Bob       30       35       25       45       40
#3: Banana George, Bob       50       55       50       60       30