Mahalanobis distance with multiple observations per group

Question

I'd like to calculate the Mahalanobis distance among groups of species where:

i) there are more than two groups (more than two species).
ii) there are multiple variables (features of such species) to be taken into account.
iii) there are multiple observations per group (in the dataframe, it means there is more than one row per specie).

I am trying to understand how to run the mahalanobis function in R, under such conditions. This question is similar to:

Mahalanobis distance on R for more than 2 groups

but there, only one variable was used. How could it be done having more than one variable?

Below there is an example, which I believe reproduces my actual data.

Sp. X1  X2  X3
A   0.7 11  215
B   0.8 7   214
B   0.8 6.5 187
C   0.3 4   456
D   0.4 3   111
A   0.1 7   205
A   0.2 7   196
C   0.1 9.3 77
D   0.6 8   135
D   0.8 4   167
B   0.4 6   228
C   0.1 5   214
A   0.4 7   156
C   0.5 2   344

_{Sp. = Specie; X1, X2 and X3 are observed variables.}

In the real dataset, there are more than 50 species and the number of observations varies among them (from 100 rows/specie to 1000).

Andre Silva Andre Silva · Accepted Answer · 2015-04-13T20:19:47

This is what I've got, using the pairwise.mahalanobis function from the HDMD package:

#data
a = structure(list(Sp = structure(c(1L, 2L, 2L, 3L, 4L, 1L, 1L, 3L,4L, 4L, 2L, 3L, 1L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"), 
                   X1 = c(0.7, 0.8, 0.8, 0.3, 0.4, 0.1, 0.2, 0.1, 0.6, 0.8,0.4, 0.1, 0.4, 0.5), 
                   X2 = c(11, 7, 6.5, 4, 3, 7, 7, 9.3,8, 4, 6, 5, 7, 2), 
                   X3 = c(215L, 214L, 187L, 456L, 111L, 205L,196L, 77L, 135L, 167L, 228L, 214L, 156L, 344L)),
              .Names = c("Sp","X1", "X2", "X3"), 
              row.names = c(NA, -14L),
              class = "data.frame")

library(HDMD) #pairwise.mahalanobis function
library(cluster) #agnes function

group = matrix(a$Sp) #what is being compared
group = t(group[,1]) #prepare for pairwise.mahalanobis function

variables = c("X1","X2","X3") #variables (what is being used for comparison)
variables = as.matrix(a[,variables]) #prepare for pairwise.mahalanobis function

mahala_sq = pairwise.mahalanobis(x=variables, grouping=group) #get squared mahalanobis distances (see mahala_sq$distance).
names = rownames(mahala_sq$means) #capture labels

mahala = sqrt(mahala_sq$distance) #mahalanobis distance
rownames(mahala) = names #set rownames in the dissimilarity matrix
colnames(mahala) = names #set colnames in the dissimilarity matrix

mahala #this is the mahalanobis dissimilarity matrix 

         A        B         C         D
A  0.00000 17.78689  86.83294  62.65437
B 17.78689  0.00000  69.07937  80.31577
C 86.83294 69.07937   0.00000 149.36579
D 62.65437 80.31577 149.36579   0.00000

#This is how I used the dissimilarity matrix to find clusters.
cluster = agnes(mahala,diss=TRUE,keep.diss=FALSE,method="complete") #hierarchical clustering
plot(cluster,which.plots=2) #plot dendrogram

Mahalanobis distance with multiple observations per group

1 Answers