1
votes

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).

First, I ran the analysis in SPSS via

CLUSTER  Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
    
FREQUENCIES CLU2_1.

I additionaly performed the analysis in R:

datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc   <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))

The resulting cluster sizes are:

      1    2
SPSS  712  288
R     610  390

These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.

I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).

Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?

1
Perhaps you could try cross-validated: stats.stackexchange.comLuizZ

1 Answers

1
votes

If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.

I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.