1
votes

I have a similar dataset:

val<-c("Y","N")
test<-data.frame(age=rnorm(n=100,mean=50,sd=10),var1=sample(val,100,T),var2=sample(val,100,T),var3=sample(val,100,T),sex=sample(c("F","M"),100,T))

I´d like to create a summary reporting the mean age for each category using Hmisc.

library(Hmisc)
summary.formula(age~sex+var1+var2+var3,data=test)

However, var1-var3 actually belong under the same categorical variable with levels var1,var and var3 instead of Y/N. Furthermore, these are not mutually exclusive. So, is it possible somehow to create a variable var4 with these different levels that are not mutually exclusive and type

summary.formula(age~sex+var4,data=test)

and have an output like:

+-------+-+---+----+
|       | |N  |age |
+-------+-+---+----+
|sex    |F| 44|48.0|
|       |M| 56|50.8|
+-------+-+---+----+
|var4   |var1| xx|yy|
|       |var2| xx|yy|
        |var3| xx|yy|
+-------+-+---+----+
|Overall| |100|49.6|
+-------+-+---+----+

Any help would be much appreciated...

2
I don't understand what you want. It's not clear to me what Var4 would be or how the subsetting in the is supposed to work for Var4. - Dason

2 Answers

1
votes

How bout paste0? (or paste(..., sep='') if you're not on R2.15)

> test$var4 <- paste0(test$var1, test$var2, test$var3)
> summary.formula(age~sex+var4, data=test)
age    N=100

+-------+---+---+--------+
|       |   |  N|     age|
+-------+---+---+--------+
|    sex|  F| 50|50.25440|
|       |  M| 50|51.32134|
+-------+---+---+--------+
|   var4|NNN| 13|46.64417|
|       |NNY| 17|51.34456|
|       |NYN| 15|52.92185|
|       |NYY| 17|47.35685|
|       |YNN|  9|50.91647|
|       |YNY|  7|48.04489|
|       |YYN| 10|53.23713|
|       |YYY| 12|56.14394|
+-------+---+---+--------+
|Overall|   |100|50.78787|
+-------+---+---+--------+
> 
0
votes

I think the problem lies in that you are trying to combine statistics for two different data sets:

  1. data indexed by person:

    summary.formula(age~sex, test)
    
    # age    N=100
    # 
    # +-------+-+---+--------+
    # |       | |N  |age     |
    # +-------+-+---+--------+
    # |sex    |F| 35|49.99930|
    # |       |M| 65|48.96266|
    # +-------+-+---+--------+
    # |Overall| |100|49.32548|
    # +-------+-+---+--------+
    
  2. data indexed by cars

Here you need one row per car; here is one way to create the data but I am sure there must be much nicer ways:

    var1 <- subset(test, var1 == "Y", c("age", "sex"))
    var2 <- subset(test, var2 == "Y", c("age", "sex"))
    var3 <- subset(test, var3 == "Y", c("age", "sex"))
    var1$var <- "var1"
    var2$var <- "var2"
    var3$var <- "var3"
    vars <- rbind(var1, var2, var3)

Then, the summary statistics:

    summary.formula(age~var, data=vars)
    # age    N=147
    # 
    # +-------+----+---+--------+
    # |       |    |N  |age     |
    # +-------+----+---+--------+
    # |var    |var1| 47|48.91983|
    # |       |var2| 43|46.31811|
    # |       |var3| 57|49.35292|
    # +-------+----+---+--------+
    # |Overall|    |147|48.32672|
    # +-------+----+---+--------+

As you can see, the Overall sections of the two summaries do not match, as they come from two different data sets. (And it is not possible to combine them the way you are asking.)