1
votes

How would one solve the following toy problem using dplyr:

Take a data frame where each row contains at least two iris species separated by spaces:

mySpecies <- data.frame(
  Species=c("lazica uniflora setosa", 
        "virginica setosa uniflora loczyi",
        "versicolor virginica"))

I'd like to add 2 columns to 'mySpecies' where each row contains the mean of the Sepal.Length and Sepal.Width for only those species available in a separate lookup table: the iris dataset: unique(iris$Species)

The output of this example should be the mySpecies data frame with additional 'Sepal.Length.mean' and 'Sepal.Width.mean' columns containing the mean of those variables across each species that appear in iris$Species.

So the first row would just contain the Sepal.Length and Sepal.Width for 'setosa', because the other species names don't appear in iris. The second row, however, would contain the means of Sepal.Length and Sepal.Width across 'virginica' and 'setosa', because they both appear in the lookup table (i.e. iris).

Note that this is a toy example but my actual dataframes are quite large.

2
So what's the desired output for your example?lukeA
It is not clear how you want the outputakrun
Did you mean iris %>% group_by(Species) %>% summarise_each(funs(mean), Sepal.Length:Sepal.Width) %>% bind_cols(., mySpecies)akrun
I've elaborated on the desired outputBDA

2 Answers

1
votes

Here you go. First, split up your string into individual species; then for each group: filter the rows that match, and compute the mean.

mySpecies %>%
    group_by(Species) %>%
    do({
        spec <- strsplit(as.character(.$Species), " ", fixed=TRUE)[[1]]
        filter(iris, Species %in% spec) %>%
            summarise_each(funs(mean), Sepal.Length, Sepal.Width)
    })
0
votes
library(dplyr)

mySpecies= c("setosa", "loczyi", "virginica")

filter(iris, Species %in% mySpecies) %>%
    group_by(iris, Species) %>% 
    summarise(mean_width = mean(Sepal.Width),
              mean_length = mean(Sepal.Length))