Elegant way of creating subsets from data.frame and binary matrix

Question

I have a data frame with columns Price and Material, and a true/false matrix with N columns (each column is a particular type of material), and the T/F value denotes if the 'material' string appears in the data matrix

Data

Price    Material
2.33     Metal nickel linen cotton
3.45     silver emerald steel
7.45     cotton silk wood

Matrix

Metal Nickel Linen Cotton Silver Emerald Steel Cotton Silk Wood
T     T      T     T      0      0       0     0      0    0
0     0      0     0      T      T       T     0      0    0

...etc.

How do I create a subset the prices based on the material? So I can calculate the mean, range mode etc of prices which have the material 'Metal'.

My initial solution was to multiply the

newMat<- data$price * materialmatrix.

and then perform column operations on newMat (mean, quantile etc.)

But this seems like a brutal way of doing things, as I would like to combine the subsets (e.g average price for something with Metal && Cotton).

I also tried

split(data, wsearch, drop=TRUE)

but got the warning.

Warning message:
  In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...)  
  data length is not a multiple of split variable

Was trying to use lapply, split, ddply, and subset, but my understanding of R is not strong enough to execute.

I know this is probably very simple, but I am stuck on how to use the matrix to create multiple subsets, rather than create one subset at a time.

Any help would be great.

I have looked at the following

Subsetting a data.frame with an integer matrix

subsetting matrix with id from another matrix

Select observations from a subset to create a new subset based on a large dataframe in R

R Selecting column in a data frame by column in another data frame

It is unclear what the structure of Data is. Best would be if you could please provide a reproducible example, e.g. give us the outputs of dput(Data) and dput(Matrix). — flodel
@flodel dput(head(data$Price)) c(17, 35, 12, 26, 1.35, 10) c("linen","wax string metal cube", "brass", "linen") — conr404
dput(head(wsearch)) structure(c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, ..........FALSE), .Dim = c(6L, 140L)) — conr404

Henrik Henrik · Accepted Answer · 2013-10-25T19:53:34

Is this what you want?

library(reshape2)
library(splitstackshape)

# sample data
df <- data.frame(price = c(17, 35, 12, 26, 1.35, 10),
                 material = c("linen",
                              "wax string metal cube",
                              "Metal nickel linen cotton",
                              "brass",
                              "linen",
                              "cotton silk wood"))

# split the concatenated material variable
df2 <- concat.split(data = df, split.col = "material", sep = " ", drop = TRUE)

# replace blanks with NA
df2[df2 == ""] <- NA

# melt data to long format
df3 <- melt(df2, id.vars = "price", na.rm = TRUE)

# calculate summary stats by material (= 'value' variable)
df4 <- aggregate(price ~ value, data = df3, summary)

#     value price.Min. price.1st Qu. price.Median price.Mean price.3rd Qu. price.Max.
# 1   brass     26.000        26.000       26.000     26.000        26.000     26.000
# 2  cotton     10.000        10.500       11.000     11.000        11.500     12.000
# 3    cube     35.000        35.000       35.000     35.000        35.000     35.000
# 4   linen      1.350         6.675       12.000     10.120        14.500     17.000
# 5   metal     35.000        35.000       35.000     35.000        35.000     35.000
# 6   Metal     12.000        12.000       12.000     12.000        12.000     12.000
# 7  nickel     12.000        12.000       12.000     12.000        12.000     12.000
# 8    silk     10.000        10.000       10.000     10.000        10.000     10.000
# 9  string     35.000        35.000       35.000     35.000        35.000     35.000
# 10    wax     35.000        35.000       35.000     35.000        35.000     35.000
# 11   wood     10.000        10.000       10.000     10.000        10.000     10.000

Elegant way of creating subsets from data.frame and binary matrix

1 Answers