3
votes

I have a dataset that has the column names Gender, IQ, and Brain_Mass. Only Gender is a categorical variable of course, so I assigned it a dummy variable by setting it as gender=factor(Gender).

However, I want to find the covariance matrix and the correlation matrix. I know that I can just use the cov2cor(V) to get the correlation matrix, but how do I get the covariance matrix from this data? I don't think I can just take the var(data) since a dummy variable exists..

I would really appreciate it if anyone could help out. Thanks.

5
For the sake of future observers, I do not believe using cov (even with method = spearman) is not correct. Spearman correlation is for continuous and ordinal data, and gender doesn't belong in either of these!Suren

5 Answers

2
votes

If you have a legitimate reason for calculating the correlation matrix on a combination of continuous and categorical data (such as needing it for input into another function), then one approach is to use the model.matrix function to convert the factors to their dummy variable encoding, then pass the result to the cor or other function for calculating the correlations or covariances:

> cor(model.matrix(~.-1,data=iris[,3:5]))
                  Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
Petal.Length         1.0000000   0.9628654    -0.9227654         0.2017545        0.7210109
Petal.Width          0.9628654   1.0000000    -0.8873437         0.1178988        0.7694449
Speciessetosa       -0.9227654  -0.8873437     1.0000000        -0.5000000       -0.5000000
Speciesversicolor    0.2017545   0.1178988    -0.5000000         1.0000000       -0.5000000
Speciesvirginica     0.7210109   0.7694449    -0.5000000        -0.5000000        1.0000000
> 
1
votes

I don't see why you would want to include a factor variable in the calculation of correlation. I would recommend removing that variable and only calculating cor for the smaller data.frame:

set.seed(1)
n <- 100
df <- data.frame(Gender=sample(c("male", "female"), n, replace=TRUE), IQ=rnorm(n, mean=100, sd=10), Brain_Mass=rnorm(n, mean=5000, sd=500))
head(df)

COV <- cov(df[,2:3])
COR <- cor(df[,2:3])
COV; COR

You could technically convert Gender to numeric and then do the same:

df$Gender <- as.numeric(df$Gender)
cor(df)
1
votes

Even though there is nothing (technical) preventing you from computing Pearson's or Spearman's correlations between continuous and dichotomous variables, I would also take a look at what is called "point-biserial correlation", a rather exotic name for what is in fact very closely related to Pearson's correlation but with a twist !

There is an R package for that ;)

0
votes

It is not the best thing to use the same correlation (or covariance) calculation between categorical and continuous data. You should use pearson's correlation for continuous data and spearman's correlation for categorical data. These two methods might produce similar results in some cases.

for covariance try:

cov(data_set,method='spearman') 

or

cov(data_set,method='pearson') #this is the default 

depending on the method you want to choose according to your data type.

For the correlation replace the cov() function with cor().

The factor variable you have needs to be converted to numeric beforehand:

gender <- as.numeric(gender)

UPDATE:

Just to make sure that you are calculating correlations in a correct way I believe you should probably convert all of your variables into one type i.e. all continuous or all categorical. The typical way is to bin your continuous data into categorical (yes you might lose some information value, but in general you will get what you want) and then use the spearman correlation/covariance matrix. This way at least your calculations are consistent and you can do everything in one go using cov() or cor()

0
votes

Use the hepcor function of the polycor package. It uses the apropriate methods according to variable type: continuous, categorical. Works like a charm!

https://www.rdocumentation.org/packages/polycor/versions/0.7-10/topics/hetcor

Below is the output for the code in the vignette above.

Correlations/Type of Correlation:
       x1      x2         y1         y2
x1      1 Pearson Polyserial Polyserial
x2 0.5577       1 Polyserial Polyserial
y1 0.5537  0.7484          1 Polychoric
y2 0.6301  0.6274     0.6052          1