1
votes

I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data. I used gower distance matrix when creating the model:

> gowerdist_train <- daisy(analdata_train,
                   metric = "gower",
                   stand = FALSE,
                   type = list(asymm = c(5,6)))

Using this gowerdist matrix, the dbscan clustering model created was:

> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)

Then I try to use predict to label a test dataset using the above dbscan object:

> predict(sb, newdata = analdata_test, data = analdata_train)

But I receive the following error:

Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE, ...) : x has to be a numeric matrix

I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data. My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?

In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?

What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?

The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for test data because distances must be relative to the test data, not the test data itself.


Edit:

I tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:

> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
                         metric = "gower",
                         stand = FALSE,
                         type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)

ERROR: Error in 1:nrow(data) : argument of length 0 In addition: Warning message: In rbind(data, newdata) : number of columns of result is not a multiple of vector length (arg 1)

So, my suggested solution doesn't work.

1
predict on DBSCAN is not really well defined. Usually, it indicates you are using clustering when you should be doing classification.Has QUIT--Anony-Mousse
OK, so I contacted the person responsible for maintaining dbscan R package and was told that there is a problem with the implementation of the predict in dbscan package (it doesn't work with distance matrices) and that I should raise the issue at the corresponding forum so it's fixed in the package. In the meantime i decided to try building an algorithm to do the job.Ankhnesmerira
I am talking about the theoretical support for doing this, rather than the implementation (as I don't use R, I don't care much about the package). A single point could cause clusters to merge, so the "predict" function would need to be able to return "it would cause clusters 1+2 to merge", or "it would cause a new cluster here" etc. The idea of "prediction" with DBSCAN is problematic, and usually indicates a bad idea earlier in the approach.Has QUIT--Anony-Mousse

1 Answers

1
votes

I decided to create a code that would use KNN algorithm in dbscan to predict cluster labeling using gower distance matrix. The code is not very pretty and definitely not programmaticaly efficient but it works. Happy for any suggestions that would improve it.

The pseydocode is: 1) calculate new gower distance matrix for all data, including test and train 2) use the above distance matrix in kNN function (dbscan package) to determine the k nearest neighbours to each test data point. 3) determine the cluster labels for all those nearest points for each test point. Some of them will have no cluster labeling because they are test points themselves 4) create a count matrix to count the frequency of clusters for the k nearest points for each test point 5) use very simple likelihood calculation to choose the cluster for the test point based on its neighbours clusters (the maximum frequency). this part also considers the neighbouring test points. That is, the cluster for the test point is chosen only when the maximum frequency is largest when you add the number of neighbouring test points to the other clusters. Otherwise, it doesn't decide the cluster for that test point and waits for the next iteration when hopefully more of its neighboring test points have had their cluster label decided based on their neighbours. 6) repeat above (steps 2-5) until you've decided all clusters

** Note: this algorithm doesn't converge all the time. (once you do the math, it's obvious why that is) so, in the code i break out of the algorithm when the number of unclustered test points doesn't change after a while. then i repeat 2-6 again with new knn (change the number of nearest neighbours and then run the code again). This will ensure more points are involved in deciding in th enext round. I've tried both larger and smaller knn's and both work. Would be good to know which one is better. I haven't had to run the code more than twice so far to decide the clusters for the test data point.

Here is the code:

#calculate gower distance for all data (test + train)
gowerdist_test <- daisy(all_data[rangeofdataforgowerdist],
                        metric = "gower",
                        stand = FALSE,
                        type = list(asymm = listofasymmvars),
                        weights = Weights)
summary(gowerdist_test) 

Then use the code below to label clusters for test data.

#library(dbscan)
# find the k nearest neibours for each point and order them with distance
iteration_MAX <- 50
iteration_current <- 0
maxUnclusterRepeatNum <- 10
repeatedUnclustNum <- 0
unclusteredNum <- sum(is.na(all_data$Cluster))
previousUnclustereNum <- sum(is.na(all_data$Cluster))
nn_k = 30 #number of neighbourhoods

while (anyNA(all_data$Cluster) & iteration_current < iteration_MAX) 
{
  if (repeatedUnclustNum >= maxUnclusterRepeatNum) {
    print(paste("Max number of repetition (", maxUnclusterRepeatNum ,") for same unclustered data has reached. Clustering terminated unsuccessfully."))
    invisible(gc())
    break;
  }

      nn_test <- kNN(gowerdist_test, k = nn_k, sort = TRUE)

    # for the TEST points in all data, find the closets TRAIN points and decide statistically which cluster they could belong to, based on the clusters of the nearest TRAIN points
    test_matrix <- nn_test$id[1: nrow(analdata_test),] #create matrix of test data knn id's
    numClusts <- nlevels(as.factor(sb_train$cluster))
    NameClusts <- as.character(levels(as.factor(sb_train$cluster)))
    count_clusters <- matrix(0, nrow = nrow(analdata_test), ncol = numClusts + 1)  #create a count matrix that would count number of clusters + NA
    colnames(count_clusters) <- c("NA", NameClusts) #name each column of the count matrix to cluster numbers

    # get the cluster number of each k nearest neibhour of each test point
    for (i in 1:nrow(analdata_test)) 
      for (j in 1:nn_k)
      {  
        test_matrix[i,j] <- all_data[nn_test$id[i,j], "Cluster"]
      }
    # populate the count matrix for the total clusters of the neighbours for each test point
    for (i in 1:nrow(analdata_test))
      for (j in 1:nn_k)
      {  
       if (!is.na(test_matrix[i,j])) 
           count_clusters[i, c(as.character(test_matrix[i,j]))] <- count_clusters[i, c(as.character(test_matrix[i,j]))] + 1
       else 
          count_clusters[i, c("NA")] <- count_clusters[i, c("NA")] + 1
      }
    # add NA's (TEST points) to the other clusters for comparison
    count_clusters_withNA <- count_clusters
    for (i in 2:ncol(count_clusters))
      {  
      count_clusters_withNA[,i] <- t(rowSums(count_clusters[,c(1,i)]))
    }

    # This block of code decides the maximum count of cluster for each row considering the number other test points (NA clusters) in the neighbourhood
    max_col_countclusters <- apply(count_clusters,1,which.max) #get the column that corresponds to the maximum value of each row
    for (i in 1:length(max_col_countclusters)) #insert the maximum value of each row in its associated column in count_clusters_withNA
      count_clusters_withNA[i, max_col_countclusters[i]] <- count_clusters[i, max_col_countclusters[i]]
    max_col_countclusters_withNA <- apply(count_clusters_withNA,1,which.max) #get the column that corresponds to the maximum value of each row with NA added 
    compareCountClust <- max_col_countclusters_withNA == max_col_countclusters  #compare the two count matrices
    all_data$Cluster[1:nrow(analdata_test)] <- ifelse(compareCountClust, NameClusts[max_col_countclusters - 1], all_data$Cluster) #you subtract one because of additional NA column


    iteration_current <- iteration_current + 1

    unclusteredNum <- sum(is.na(all_data$Cluster))
    if (previousUnclustereNum == unclusteredNum)
      repeatedUnclustNum <- repeatedUnclustNum + 1
    else {
      repeatedUnclustNum <- 0
      previousUnclustereNum <- unclusteredNum
    }

    print(paste("Iteration: ", iteration_current, " - Number of remaining unclustered:", sum(is.na(all_data$Cluster))))
    if (unclusteredNum == 0)
      print("Cluster labeling successfully Completed.")

    invisible(gc())
}

I guess you can use this for any other type of clustering algorithm, it doesn't matter how you decided the cluster labels for the train data, as long as they are in your all_data before running the code. Hope this help. Not the most efficient or rigorous code. So, happy to see suggestions how to improve it.

*Note: I used t-SNE to compare the clustering of train with the test data and looks impressively clean. so, it seems it is working.