0
votes

Reproducible data:

Data <- data.frame(
    X = sample(c(0,1), 10, replace = TRUE),
    Y = sample(c(0,1), 10, replace = TRUE),
    Z = sample(c(0,1), 10, replace = TRUE)
)

Convert dataframe to matrix

Matrix_from_Data <- data.matrix(Data)

Check the structure

str(Matrix_from_Data)

num [1:10, 1:3] 1 0 0 1 0 1 0 1 1 1 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:3] "X" "Y" "Z"

The question: I have dataframe of binary, symmetric variables (larger than the example), and I'd like to do some hierarchical clustering, which I've never tried before. There are no missing or NA values.

I convert the dataframe into a matrix before attempting to run the daisy function from the 'cluster' package, to get the dissimilarity matrix. I'd like to explore the options for calculating different dissimilarity metrics, but am running into a warning (not an error):

library(cluster)
Dissim_Euc_Matrix_from_Data <- daisy(Matrix_from_Data, metric = "euclidean", type = list(symm =c(1:ncol(Matrix_from_Data))))    

Warning message: In daisy(Matrix_from_Data, metric = "euclidean", type = list(symm = c(1:ncol(Matrix_from_Data)))) : with mixed variables, metric "gower" is used automatically

...which seems weird to me, since "Matrix_from_Data" is all numeric variables, not mixed variables. Gower might be a fine metric, but I'd like to see how the others impact the clustering. What am I missing?

1

1 Answers

2
votes

Great question.

First, that message is a Warning and not an Error. I'm not personally familiar with daisy, but my ignorant guess is that that particular warning message pops up when you run the function and doesn't do any work to see if the warning is relevant.

Regardless of why that warning appears, one simple way to compare the clustering done by several different distances measures in hierarchical clustering is to plot the dendograms. For simplicity, let's compare the "euclidean" and "binary" distance metrics programmed into dist. You can use ?dist to read up on what the "binary" distance means here.

# When generating random data, always set a seed if you want your data to be reproducible
set.seed(1)
Data <- data.frame(
  X = sample(c(0,1), 10, replace = TRUE),
  Y = sample(c(0,1), 10, replace = TRUE),
  Z = sample(c(0,1), 10, replace = TRUE)
)

# Create distance matrices
mat_euc <- dist(Data, method="euclidean")
mat_bin <- dist(Data, method="binary")

# Plot the dendograms side-by-side
par(mfrow=c(1,2))
plot(hclust(mat_euc))
plot(hclust(mat_bin))

hclust_test

I generally read dendograms from the bottom-up since points lower on the vertical axis are more similar (i.e. less distant) to one another than points higher on the vertical axis.

We can pick up a few things from these plots:

  • 4/6, 5/10, and 7/8 are grouped together using both metrics. We should hope this is true if the rows are identical :)
  • 3 is most strongly associated with 7/8 for both distance metrics, although the degree of association is a bit stronger in the binary distance as opposed to the Euclidean distance.
  • 1, 2, and 9 have some notably different relationships between the two distance metrics (e.g. 1 is most strongly associated with 2 in Euclidean distance but with 9 in binary distance). It is in situations like this where the choice of distance metric can have a significant impact on the resulting clusters. At this point it pays to go back to your data and understand why there are differences between the distance metrics for these three points.

Also remember that there are different methods of hierarchical clustering (e.g. complete linkage and single linkage), but you can use this same approach to compare the differences between methods as well. See ?hclust for a complete list of methods provided by hclust.

Hope that helps!