R distance matrix and clustering for mixed and large dataset?

Question

My intention involves clustering retail data for customer segmentation in r.

I need the full dataset for clustering, but will split into training/testing when evaluating the model. The dataset has 133,153 observations of 36 variables with numerical, categorical, and missing values (14.1 MB).

How can I cluster in r with a mixed and large dataset?

My Machine:

sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin13.1.0 (64-bit)

Mac OSX Version 10.9.3 4GB RAM

Here's a thread suggesting the daisy () package be used for mixed data types before using a clustering algorithm such as k-means: implementation of the Gower distance function.

I cannot use daisy because of the error about being unable to allocate vector. There's scalability issues of matrix-oriented approach before traditional clustering methods like k-means.

Error:

#Load Data
Store1 <- read.csv("/Users/scdavis6/Documents/Work/TowerData/TowerData/Client1.csv", head=FALSE)
#Convert csv to data.frame
df <-as.data.frame(Store1)
#Create dissimilarity matrix
daisy1 <- daisy(df)
Error: cannot allocate vector of size 66.0 Gb

Another thread suggests the bigmemory package be used for memory management in r: R memory management / cannot allocate vector of size n Mb.

I cannot store the data in matrix using the read.big.matrix () function because the bigmemory package doesn’t allow mixed datatypes.

Please let me know if I can provide more information.

@joran I described the problem and linked related articles from stackoverflow.com. I am continuing the discussion about clustering mixed variables by showing r not being capable b/c of scalability issues. In another link, commenters suggested the bigmemory package in r, it will not work with mixed datatypes. I have not seen anyone else on stackoverflow.com looking into this problem as throughly. I tried a matrix-oriented approach paired with a memory-sharing package, but it did not solve the problem. So far, that is where the discussion ends on the site! Please make the post active. — Scott Davis

Bob Bob · Accepted Answer · 2014-11-07T16:14:18

I have been stuck on the same issue. For the way of computing distances you may want to use Gower transformation. If you had not continuos data you could use an overlap function, which I did not manage to find on R yet (this paper). Here what I found for the computation problem:

To compute the distances on a very large dataset with too many N observations to be computationally feasibile, it is possilbe to apply the solution used in this recent paper (this one). They propose a smart way to proceed: they create a new dataset, where each new row is a possible combination of values over the d attributes in the original dataset. Therefore, this will give a new matrix with M < N osbervations for which the distance matrix can be computationally feasible. They "create a grid of all possible cases, with their corresponding distances (of each from each ther) and used this grid to create our clusters, to which we subsequently assiggned our observations"

I tried to reproduce that in R making use of this answer with the library(plyr). In the following I will use just 4 observations but it should work with Nobservations, as long as the combinations you produce will reduce the memory requirement

id <- c(1,2,3,4)
a <- c(1,1,0,1)
b <- c(0,1,0,0)
c <- c(3,2,1,3)
d <- c(1,0,1,1)
Mydata <- as.data.frame(cbind(id, a,b,c,d))
Mydata
id a b c d
1  1 0 3 1
2  1 1 2 0
3  0 0 1 1
4  1 0 3 1

require(plyr)
Mydata_grid <-  count(Mydata[,-1])
Mydata_grid
a b c d freq
1 0 3 1  2
1 1 2 0  1
0 0 1 1  1

Where freq is the frequency of the combination in the origial Mydata. Then I just apply the distance measure I do prefer to Mydata_grid. In this case my data are categorical, therefore I apply jaccard (which I dont know if it is correct for the data in the example. Maybe I should have used an overlapmatching function but I did not find it in R yet)

require(vegan)
dist_grid <- vegdist(Mydata_grid, method="jaccard")
d_matrix <- as.matrix(dist_grid)
d_matrix
          1         2          3
1 0.0000000 0.5714286  0.6666667
2 0.5714286 0.0000000  0.5000000
3 0.6666667 0.5000000  0.0000000

which is our distance_matrix. Now it is sufficient to directly cluster d_grid

clusters_d <- hclust(dist_grid, method="ward.D2")
cluster <- cutree(clusters_d, k = 2) # k= number of clusters 
cluster
1 2 1

which is the vector which assigns each combination to each cluster. Now it is enough to go back to the original sample and it is done. For doing this just do

Mydata_cluster <- cbind(Mydata_grid, cluster, Mydata_grid$freq)

and then expand the sample to the original dimension using rep

Mydata_cluster_full <- Mydata_cluster[rep(row.names(Mydata_cluster), Mydata_cluster$freq), 1:(dim(Mydata_cluster)[2]-1)]
Mydata_cluster_full
    a b c d freq cluster
    0 0 1 1    1       1
    1 0 3 1    2       2
    1 0 3 1    2       2
    1 1 2 0    1       1

You can also add the original idvector and remove the freqcolumnd

Mydata_cluster_full$id <- id
Mydata_cluster_full$freq <- NULL

a b c d freq cluster id
0 0 1 1    1       1  1
1 0 3 1    2       2  2
1 0 3 1    2       2  3
1 1 2 0    1       2  4

If you are not unlucy, this process will reduce the amount of memory needed to compute your distance matrix to a feasible level.

R distance matrix and clustering for mixed and large dataset?

1 Answers