Clustering leads to very concentrated clusters

Question

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8

Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.

Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.

The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.

With help from another one, I managed to create a distance matrix and use the PAM clustering.

Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.

Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.

Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?

The whole code and results:

#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1

Now using the PAM clustering

dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))

But I get very concentrated clusters, as:

1   2   3   4 
382 100  23  62

Your code fails at ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1] for me. — AidanGawronski
ncol(df) # 3 ncol(df[, .(V1 = V2, V2 = V1, N)]) # 4, in rbind you have different numbers of columns so they cannot be put together. — AidanGawronski
Updated my post with new pastebin, can you try again? Thank you — Mayo

AidanGawronski AidanGawronski · Accepted Answer · 2018-05-25T17:32:39

I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.

dim(df) # [1] 7659    3

test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318     3

length(unique(test$V1)) # 567
length(unique(test$V2)) # 567

test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567

Clustering leads to very concentrated clusters

2 Answers