R, SOM, Kohonen Package, Outlier Detection

Question

With SOM I experimented a little. First I used MiniSOM in Python but I was not impressed and changed to the kohonen package in R, which offers more features than the previous one. Basically, I applied SOM for three use cases: (1) clustering in 2D with generated data, (2) clustering with more-dimensional data: built-in wine data set, and (3) outlier detection. I solved all the three use cases but I would like to raise a question in connection with the outlier detection I applied. For this purpose I used the vector som$distances, which contains a distance for each row of the input data set. The values with excelling distances can be outliers. However, I do not know how this distance is computed. The package description (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) states for this metric : "distance to the closest unit".

Could you please tell how this distance is computed?
Could you please comment the outlier detection I used? How would you have done it? (In the generated data set it really finds the outliers. In the real wine data set there are four relatively excelling values among the 177 wine sorts. See the charts below. The idea that crossed my mind to use bar charts for depicting this I really like.)

Charts:

Generated data, 100 point in 2D in 5 distinct clusters and 2 outliers (Category 6 shows the outliers):
Distances shown for all the 102 data points, the last two ones are the outliers which were correctly identified. I repeated the test with 500, and 1000 data points and added solely 2 outliers. The outliers were also found in those cases.
Distances for the real wine data set with potential outliers:

The row id of the potential outliers:

# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]

it produces the following output:
    index    value
59     59 12.22916
110   110 13.41211
121   121 15.86576
158   158 11.50079

My annotated code snippet:

        data(wines)

        scaled_wines <- scale(wines)

        # creating and training SOM
        som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))
        summary(som.wines)

        #looking for outliers, dist = distance to the closest unit
        som.wines$distances

        len <- length(som.wines$distances)
        index_in_vector <- c(1:len)
        df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
        colnames(df_wine) <-c("index", "value")

        po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity") 
        po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")
        plot(po)

        # print the row id of the outliers
        # the threshold 10 can be taken from the bar chart,
        # below which the vast majority of the values fall
        df_wine[df_wine$value > 10, ]

Further Code Samples

With regard to the discussion in the comments I also post the code snippets asked for. As far as I remember, the code lines responsible for clustering I constructed based on samples I found in the description of the Kohonen package (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf). However, I am not completely sure, it was more than a year ago. The code is provided as is without any warranty :-). Please bear in mind that a particular clustering approach may perform with different accuracy on different data. I would also recommend to compare it with t-SNE on the wine data set (data(wines) available in R). Moreover, implement the heat-maps to demonstrate how the data with regard to individual variables are located. (In the case of the above example with 2 variables it is not important but it would be nice for the wine data set).

Data Generation with Five Clusters and 2 Outliers and Plotting

            library(stats)
            library(ggplot2)

            library(kohonen)


            generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
              num_of_points_per_cluster <- num_of_points/num_of_clusters
              cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s \n", num_of_points_per_cluster, num_of_clusters))
              arr<-array()
              
              standard_dev_y <- 6000
              standard_dev_x <- 2
              
              # for reproducibility setting the random generator
              set.seed(10)
              
              for (i in 1:num_of_clusters){
                centroid_y <- runif(1, min=10000, max=200000)
                centroid_x <- runif(1, min=20, max=70)
                cat(sprintf("centroid_x = %s \n, centroid_y = %s", centroid_x, centroid_y ))
                
                vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
                vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
                cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
                cluster <- cbind(cluster, i)
                
                arr <- rbind(arr, cluster)
              }
              
              if(outliers){
                #adding two outliers
                arr <- rbind(arr, c(10000, 30, 6))
                arr <- rbind(arr, c(150000, 70, 6))
              }
              
              colnames(arr) <-c("y", "x", "Cluster")
              # WA to remove the first NA row
              arr <- na.omit(arr)
              return(arr)
            }

            scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
              
              df <- data.frame(data_in)
              colnames(df) <-c("y", "x", "Cluster")

              pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx]))) 
              pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))
              print(pl)
              
            }

            ##################
            # generating data
            data <- generate_data(100, 5, TRUE)
            print(data)
            scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers \n 102 Points")

Preparation, Clustering and Plotting

I used the hierarchical clustering approach with the Kohonen Map (SOM).

            normalising_data <- function(data){
              # normalizing data points not the cluster identifiers
              mtrx <- data.matrix(data)
              umtrx <- scale(mtrx[,1:2])
              umtrx <- cbind(umtrx, factor(mtrx[,3]))
              colnames(umtrx) <-c("y", "x", "Cluster")
              return(umtrx)
            }

            train_som <- function(umtrx){
              # unsupervised learning
              set.seed(7)
              g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
              #map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
              map<-som(umtrx[, 1:2], grid=g)
              summary(map)
              
              return(map)
            }

            plot_som_data <- function(map){
              par(mfrow=c(3,2))
              # to plot some charactristics of the SOM map
              plot(map, type='changes')
              plot(map, type='codes', main="Mapping Data")
              plot(map, type='count')
              plot(map, type='mapping') # how many data points are held by each neuron
              plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
              
              #to switch the plot config to the normal
              par(mfrow=c(1,1))
            }

            plot_disstances_to_the_closest_point <- function(map){
              
              # to see which neuron is assigned to which value 
              map$unit.classif
              
              #looking for outliers, dist = distance to the closest unit
              map$distances
              max(map$distances)
              
              len <- length(map$distances)
              index_in_vector <- c(1:len)
              df<-data.frame(cbind(index_in_vector, map$distances))
              colnames(df) <-c("index", "value")
              
              po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity") 
              po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")
              plot(po)
              
              return(df)
              
            }


            ###################
            # unsupervised learning

            umtrx <- normalising_data(data)

            map<-train_som(umtrx)
            plot_som_data(map)

            #####################
            # creating the dendogram and then the clusters for the neurons
            dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')
            plot(dendogram)

            clusters <- cutree(dendogram, 7)
            clusters
            length(clusters)

            #visualising the clusters on the map
            par(mfrow = c(1,1))
            plot(map, type='dist.neighbours', main="Mapping Data")
            add.cluster.boundaries(map, clusters)

Plots with the Clusters

You can also create nice heat-maps for selected variables but I had not implemented them for clustering with 2 variables it does not really make sense. If you implement it for the wine data set, please add the code and the charts to this post.

            #see the predicted clusters with the data set
            # 1. add the vector of the neuron ids to the data
            mapped_neurons <- map$unit.classif
            umtrx <- cbind(umtrx, mapped_neurons)

            # 2. taking the predicted clusters and adding them the the original matrix
            # very good description of the apply functions:
            # https://www.guru99.com/r-apply-sapply-tapply.html
            get_cluster_for_the_row <- function(x, cltrs){
              return(cltrs[x])
            }

            predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)

            mtrx <- cbind(mtrx, predicted_clusters)
            scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers \n 100 points")

See the predicted clusters below in case there were outliers and in case there were not.

Great question Tamas! How did you make the "5 clusters with 2 outliers plot"? — Noob
Hi Noob, As far as I remember, I was selecting randomly the x and y coordinates for a 2D space in a given range for the centroids of the clusters. Then around each centroid (x,y) I generated random x' and y' values from a normal distribution with the mean x and mean y. The standard deviation I have chosen as I liked :-). When I had the points in the 5 clusters, I arbitrarily added 2 points which did not belong to any cluster generated. If you need, I will try to look for the code I have written.... — Tamas
I am also working on a similar question here: stackoverflow.com/questions/65864333/… — Noob
I will look for the code I think I have it somewhere :-). I will get back to you in 1..2 days. — Tamas

h45 h45 · Accepted Answer · 2019-08-24T18:37:21

I am not quite sure though, but I often find that the distance measurement of two objects reside in a particular dimensional space uses mostly Euclidean distance. For example, two points A and B in a two dimensional space having location of A(x=3, y=4) and B(x=6, y=8) are 5 distance unit apart. It is a result of performing calculation of squareroot((3-6)^2 + (4-8)^2). This is also applied to the data whose greater dimension, by adding trailing power of two of the difference of the two point's value in a particular dimension. If A(x=3, y=4, z=5) and B(x=6, y=8, z=7) then the distance is squareroot((3-6)^2 + (4-8)^2 + (5-7)^2), and so on. In kohonen, I think that after the model has finished the training phase, the algorithm then calculates the distances of each datum to all nodes and then assign it to the nearest node (a node which has the smallest distance to it). Eventually, the values inside the variable 'distances' returned by the model is the distance of every datum to its nearest node. One thing to note from your script is that the algorithm does not measure the distance directly from the original property values that the data have, because they have been scaled prior to feeding the data to the model. The distance measurement is applied to the scaled version of the data. The scaling is a standard procedure to eliminate the dominance of a variable on top of another.
I believe that your method is acceptable, because the values inside the 'distances' variable are the distance of each datum to its nearest node. So if a value of the distance between a datum and its nearest node is high, then this also means: the distance of the datum to other nodes are obviously much much higher.