I have been stuck on the same issue. For the way of computing distances you may want to use Gower transformation. If you had not continuos data you could use an overlap function, which I did not manage to find on R yet (this paper). Here what I found for the computation problem:
To compute the distances on a very large dataset with too many N
observations to be computationally feasibile, it is possilbe to apply the solution used in this recent paper (this one). They propose a smart way to proceed: they create a new dataset, where each new row is a possible combination of values over the d
attributes in the original dataset. Therefore, this will give a new matrix with M < N
osbervations for which the distance matrix can be computationally feasible. They "create a grid of all possible cases, with their corresponding distances (of each from each ther) and used this grid to create our clusters, to which we subsequently assiggned our observations"
I tried to reproduce that in R making use of this answer with the library(plyr)
. In the following I will use just 4 observations but it should work with N
observations, as long as the combinations you produce will reduce the memory requirement
id <- c(1,2,3,4)
a <- c(1,1,0,1)
b <- c(0,1,0,0)
c <- c(3,2,1,3)
d <- c(1,0,1,1)
Mydata <- as.data.frame(cbind(id, a,b,c,d))
Mydata
id a b c d
1 1 0 3 1
2 1 1 2 0
3 0 0 1 1
4 1 0 3 1
require(plyr)
Mydata_grid <- count(Mydata[,-1])
Mydata_grid
a b c d freq
1 0 3 1 2
1 1 2 0 1
0 0 1 1 1
Where freq
is the frequency of the combination in the origial Mydata
. Then I just apply the distance measure I do prefer to Mydata_grid
. In this case my data are categorical, therefore I apply jaccard (which I dont know if it is correct for the data in the example. Maybe I should have used an overlap
matching function but I did not find it in R yet)
require(vegan)
dist_grid <- vegdist(Mydata_grid, method="jaccard")
d_matrix <- as.matrix(dist_grid)
d_matrix
1 2 3
1 0.0000000 0.5714286 0.6666667
2 0.5714286 0.0000000 0.5000000
3 0.6666667 0.5000000 0.0000000
which is our distance_matrix. Now it is sufficient to directly cluster d_grid
clusters_d <- hclust(dist_grid, method="ward.D2")
cluster <- cutree(clusters_d, k = 2) # k= number of clusters
cluster
1 2 1
which is the vector which assigns each combination to each cluster. Now it is enough to go back to the original sample and it is done. For doing this just do
Mydata_cluster <- cbind(Mydata_grid, cluster, Mydata_grid$freq)
and then expand the sample to the original dimension using rep
Mydata_cluster_full <- Mydata_cluster[rep(row.names(Mydata_cluster), Mydata_cluster$freq), 1:(dim(Mydata_cluster)[2]-1)]
Mydata_cluster_full
a b c d freq cluster
0 0 1 1 1 1
1 0 3 1 2 2
1 0 3 1 2 2
1 1 2 0 1 1
You can also add the original id
vector and remove the freq
columnd
Mydata_cluster_full$id <- id
Mydata_cluster_full$freq <- NULL
a b c d freq cluster id
0 0 1 1 1 1 1
1 0 3 1 2 2 2
1 0 3 1 2 2 3
1 1 2 0 1 2 4
If you are not unlucy, this process will reduce the amount of memory needed to compute your distance matrix to a feasible level.