7
votes

Hello Community,

I'm new (as a member) to the site, so if you think it might be better to post it on http://datascience.stackexchange.com, let me know.

I am tackling a Machine Learning problem which requires to calculate the distance between NxM-dimensional elements, in order to implement certain Classification algorithms.

The element's attribute is a 2D matrix (Matr), thus I'm searching for the best algorithm to calculate the distance between 2D matrices. As you will see bellow the "easy" solution is to convert the 2D into a 1D (vector) and then implement any distance algorithm, but I'm searching for something more convenient (if exists).

So far I have used the following approaches:

  1. Euclidean distance between each element.

    import numpy as np
    def dist_euclidean(elem1, elem2):
        t_sum=0
        for i in range(len(elem1.Matr)):
            for j in range(len(elem1.Matr[0])):
                t_sum+= np.square(elem1.Matr[i][j]-elem2.Matr[i][j])
        return np.sqrt(t_sum)
    
  2. Cosine Similarity, in which I had to convert the (NxM) 2D matrix into (1xNM) Vector.

    from scipy.spatial import distance
    def dist_cosine(elem1, elem2):
        temp1=[]
        temp2=[]
        for i in range(len(elem1.Matr)):
            temp1.extend(elem1.Matr[i])
            temp2.extend(elem2.Matr[i])
        return distance.cosine(temp1, temp2)
    
  3. KL divergence (wiki), also found implementation only for 1D matrix (Vector), thus did the following conversions:

    • Found the entropy between each corresponding row and then average them.

      import numpy as np
      from scipy.stats import entropy
      def dist_KL_row_avg(elem1, elem2):
          Y=[]
          for i in range(len(elem1.Matr)):
              Y.append(entropy(elem1.Matr[i], elem2.Matr[i]))
          return np.average(Y)
      
    • Convert the (NxM) 2D matrix into (1xNM) Vector by appending the rows and then calculating the total entropy.

      import numpy as np
      from scipy.stats import entropy
      def dist_KL_1d_total(elem1, elem2):
          temp1=[]
          temp2=[]
          for i in range(len(elem1.Matr)):
              temp1.extend(elem1.Matr[i])
              temp2.extend(elem2.Matr[i])
          return entropy(temp1, temp2)
      
  4. KS test (wiki), also found implementation only for 1D matrix (Vector), thus did the same conversions as in the KL implementation:

    • Found the entropy between each corresponding row and then average them.

      import numpy as np
      from scipy.stats import ks_2samp
      def dist_KS_row_avg(elem1, elem2):
          Y=[]
          Z=[]
          for i in range(len(elem1.Matr)):
              Y.append(ks_2samp(elem1.Matr[i], elem2.Matr[i]))
          Z=[x[0]/x[1] for x in Y]
          return np.average(Z)
      
    • Convert the (NxM) 2D matrix into (1xNM) Vector by appending the rows and then calculating the total entropy.

      import numpy as np
      from scipy.stats import ks_2samp
      def dist_KS_1d_total(elem1, elem2):
          temp1=[]
          temp2=[]
          for i in range(len(elem1.Matr)):
              temp1.extend(elem1.Matr[i])
              temp2.extend(elem2.Matr[i])
          Y = ks_2samp(temp1, temp2)
          return Y[0]/Y[1]
      

All of the above work in my problem but I got curious since I couldn't find anything more specific that satisfied me.


Edit 1. As pltrdy suggested, here are some more info regarding the problem.

The initial data of each element is a series of codes ex(C->B->D->B->A) which then is converted to a transition matrix which is also normalized for each row. Thus each cell in our matrix represents the probability of transition from code [i] to code [j]. For example:

IN: A->C->B->B->A->C->C->A
OUT: 
    A     B     C
 A  0     0     1
 B  0.5   0.5   0
 C  0.33  0.33  0.33

Having that in mind, the final goal is to classify the different code series. The series do not have the same length but are made from the same codes. Thus the transition probability matrix has the same dimensions in every case. I had the initial question in order to find the most suitable distance algorithm, which is going to produce the best classification results.

1
You should give more informations about context/objective. I mean, to my mind, its quite impossible to suggest a good distance function without any idea of the objective. Its like saying "if you have two points use Manhattan/Euclidian(etc..) distance". We can anwser the more general distance function used in this case (e.g. like anwsering go for euclidian for your 2D points) but this would not be really accruate and maybe not fit your need.pltrdy
Thanx for the advice, I didn't post much info in the beginning in order not to confuse the reader. I hope the edit helps, let me know for any more clarifications.Haris Michailidis
Just to be sure, the classification task is to predict the probability matrix (the out in our example) from the series of code? I'm not sure this is -strictly speaking- a classification task. I mean, I never seen a matrix as an output tbh.pltrdy
Probably I wasn't clear, I will edit my question asap. The classification task is to classify the code series into classes. Because they are not fixed-length I made a transition probability matrix for each one (the possible codes in a series are the same for all, let's say 10 different codes) because all the matrices will have the same size (10x10) it is easier to compare them. Thus I'm looking for distance between matrices.Haris Michailidis
Honnestly I would go for 2, looks fine, not sure what to expect from better solution. I guess cosine would be significantly better than Euclidian, isnt it? This problem is interesting tho i think i'll experiment it :/ (did u look near Markhov Chain? thinking about this as your problem kinda look like markov)pltrdy

1 Answers

0
votes

Given two different transition matrices A and B and a probability distribution x as a row vector, the distribution after one step according to A is xA, and the distribution after one step according to B is xB. You could take (twice) the maximum statistical distance over all x between these with

numpy.linalg.norm(A - B, numpy.inf)