0
votes

I am hoping someone can help me clarify correlation matrices. Specifically, I was wondering about the output and - why is outputs the way it does.

My intent is to understand correlation between two categorical (unordered nominal) data. The data (below) is cleaned to create factors out of the nominal variables before utilizing other methods to get counts.

For example, I created a correlation matrix in R utilizing dummy data:

set.seed(1234)
randomCities<-c("Washington","Boston","Seattle","Portland","Oakland","Dallas","Miami")
randomYachts<-c("BigOl Yacht","Notsobig Yacht","Fancy Yacht","SuperFancy Yacht")
randomYears<-c(2019,2017,2016,2015,2018)
randomQuarters<-c(1,2,3,4)

dat1<-data.frame(city=sample(randomCities,400,replace = T),
                 yachts=sample(randomYachts,400,replace = T),
                 year = sample(randomYears,400,replace=T),
                 qtr = sample(randomQuarters,400,replace = T),
                 stringsAsFactors = F)

I then subset the data, converting the variables I want to examine to factors:

#store the vars as factors
fac.Yachts<-as.factor(dat1$yachts)
fac.City<-as.factor(dat1$city)

Using the gmodels package, I created a a contingency table:

#Create contingency table 
joint_counts = joint$t

joint_counts
            y
x            BigOl Yacht Fancy Yacht Notsobig Yacht
  Boston              19          12             10
  Dallas              12          18             15
  Miami               16          16             11
  Oakland              6          12             11
  Portland            14          16             14
  Seattle             12          19              9
  Washington          13          15             16

Lastly, I creating a correlation matrix, utilizing the cor() and Hmsic package:

cor1<-cor(joint_counts)

#cor() function

>cor(joint_counts)
                  BigOl Yacht  Fancy Yacht Notsobig Yacht SuperFancy Yacht
BigOl Yacht       1.000000000 -0.006586363    -0.09691724      -0.25682171
Fancy Yacht      -0.006586363  1.000000000     0.14098436       0.01312562
Notsobig Yacht   -0.096917240  0.140984364     1.00000000      -0.66337471
SuperFancy Yacht -0.256821708  0.013125623    -0.66337471       1.00000000


#Output from Hmsic
res2<-rcorr(as.matrix(joint_counts))
>res2$r
                  BigOl Yacht  Fancy Yacht Notsobig Yacht SuperFancy Yacht
BigOl Yacht       1.000000000 -0.006586363    -0.09691724      -0.25682171
Fancy Yacht      -0.006586363  1.000000000     0.14098436       0.01312562
Notsobig Yacht   -0.096917240  0.140984364     1.00000000      -0.66337471
SuperFancy Yacht -0.256821708  0.013125623    -0.66337471       1.00000000

Now, my question is - why do correlation matrices result in this output? Meaning, my intent is to see how Yacht may be related to City, but the matrix (seems?) to tell me how the levels of Yacht are correlated.

*Note: Utilizing the created *joint variable, I somewhat get this information, however, when creating a correlation matrix from it, it seems that I am only getting the relationship between the Yachts. Am I just reading correlation matrices wrong?

joint = CrossTable(fac.City,fac.Yachts,prop.chisq = F)
$prop.row
            y
x            BigOl Yacht Fancy Yacht Notsobig Yacht SuperFancy Yacht
  Boston       0.3275862   0.2068966      0.1724138        0.2931034
  Dallas       0.2142857   0.3214286      0.2678571        0.1964286
  Miami        0.2909091   0.2909091      0.2000000        0.2181818
  Oakland      0.1224490   0.2448980      0.2244898        0.4081633
  Portland     0.2187500   0.2500000      0.2187500        0.3125000
  Seattle      0.1875000   0.2968750      0.1406250        0.3750000
  Washington   0.2407407   0.2777778      0.2962963        0.1851852

1
Your question is confusing ... "Yacht" is not a variable in your data. You seem to sort of an idea of Yacht as an ordered factor, but a Pearson's R is based on two interval variables. Potentially you could use Spearman's rho but you would have to create one variable.Elin
Also you could probably treat your joint distribution as a Freq format and then use code to change it to individual observations.Elin
Yachts refers to the var yact from dat1. You can see it was converted to an unordered factor above, as with citiesOctoCatKnows
A correlation matrix is not appropriate for this. I'm not saying this to be harsh but you may be in over your head in understanding how factors are introduced to linear models. Perhaps just stick with the contingency table and calculate some of the measures of strength for that as mentioned below.Elin
Not harsh at all! It was more 'exploratory' in reference to this: r-bloggers.com/… Data structure is the same and (nominal) and was wondering if corrplot could be used as with traditional data or if, like in the blog post, Goodman and Kruskal’s tau measure would be better-OctoCatKnows

1 Answers

0
votes

Correlation is only meaningful for quantitative variables. Your code computes the correlations between the numbers of yachts of each type, i.e., the correlation between the columns of the frequency matrix.

There are analogues of correlation for qualitative variables: Cramer's V, Phi, etc.

library(DescTools) 
counts <- table(dat1[,1:2])
CramerV(counts)  # 0.15