CLARA with Gower for mixed data type

Question

I have quite a large data (11.4 million records and 9 variables). The variables are mixed consisting of ordinal, nominal and continuous data types. Accordingly, I choose Gower method to compute dissimilarity matrix for dealing with mixed data type. However, the size of the data is too big to compute. I then find another interesting method called CLARA, which uses sample to compute clustering and then assign cluster to other points of data. The problem is that I cannot find the appropriate metric to compute distance of mixed data type. In other words, there is no Gower options in both clara in cluster package and clara_medoids in ClusterR package (This is all I can find clara in R).

Why there is no gower option in CLARA? What should I do?

Seymour Seymour · Accepted Answer · 2018-05-28T09:54:25

CLARA is described in Kaufman and Rousseeuw (1990).

The characteristics of this algorithm is that it can deal with much larger dataset because of its linear complexity in both memory and computation requirements.

Gower distance calculates a dissimilarity matrix which memory complexity is exponential O(n^2) which means that you would obtain a matrix 11.4 million rows and 11.4 million columns. Clearly not feasible.

If you want to use gower's distance, you should try to work on smaller subsamples and use a bottom-up clustering approach.

CLARA with Gower for mixed data type

2 Answers