0
votes

I have quite a large data (11.4 million records and 9 variables). The variables are mixed consisting of ordinal, nominal and continuous data types. Accordingly, I choose Gower method to compute dissimilarity matrix for dealing with mixed data type. However, the size of the data is too big to compute. I then find another interesting method called CLARA, which uses sample to compute clustering and then assign cluster to other points of data. The problem is that I cannot find the appropriate metric to compute distance of mixed data type. In other words, there is no Gower options in both clara in cluster package and clara_medoids in ClusterR package (This is all I can find clara in R).

Why there is no gower option in CLARA? What should I do?

2

2 Answers

1
votes

CLARA is described in Kaufman and Rousseeuw (1990).

The characteristics of this algorithm is that it can deal with much larger dataset because of its linear complexity in both memory and computation requirements.

Gower distance calculates a dissimilarity matrix which memory complexity is exponential O(n^2) which means that you would obtain a matrix 11.4 million rows and 11.4 million columns. Clearly not feasible.

If you want to use gower's distance, you should try to work on smaller subsamples and use a bottom-up clustering approach.

0
votes

Get the source code of CLARA.

Modify it, and add Gower distance.

Because Gower uses some data dependant normalization factors (and you can't afford to precompute a distance matrix) you'll need to integrate this directly into CLARA.

Run the modified CLARA.

Make your source code publicly available as open source, so that others don't have to do the same. It will also make it easier to extend CLARA with further distance functions in the future.