2
votes

I have some dataset in which some observations are highly correlated. I am doing a clustering analysis on the distance matrix obtained from the correlation matrix. Some elements in this datasets are redundant and I want to select some representatives elements with a minimal mutual correlation. I think that a brute-force method is to simply choose one element from each cluster. But I want to know if there are more formal methods for such conceived dimensionality reduction in R ? For instance, we are doing the clustering on the mtcars dataset in the following manner:

> m=cor(t(mtcars))
> hc=hclust(as.dist(m),"ave")
> plot(hc)

We are obtaining the following dendrogram:

enter image description here

How to extract from the above dendrograms essential elements ? This mean elements which are minimally mutually correlated ?

1

1 Answers

1
votes

One option would be to use some of the pre-processing functions within the caret package.

Using your example, the code below will remove all columns that have 0.95 correlation with another column.

library(caret)
m <- cor(t(mtcars))
highlyCor <- findCorrelation(m, cutoff = .95)
t(mtcars)[,-highlyCor]

The above code is adapted from Max Kuhn's excellent book. Refer to it and caret documentation for more background and information.