12
votes

I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):

Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.

  1. First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.

  2. Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i.

  3. At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration.

Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm.

EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix.

# Read data from GitHub
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]

# Define correlation coefficient as distance of choice
def pearson(v1, v2):
  # Simple sums
  sum1 = sum(v1)
  sum2 = sum(v2)
  # Sums of the squares
  sum1Sq = sum([pow(v, 2) for v in v1])
  sum2Sq = sum([pow(v, 2) for v in v2]) 
  # Sum of the products
  pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
  # Calculate r (Pearson score)
  num = pSum - (sum1 * sum2 / len(v1))
  den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
  if den == 0: return 0
  return num / den


# Find largest distance
dist={}
max_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
  for j in range(i + 1, len(data)):
     # Compute distance for each pair
     dist_curr = pearson(data[i], data[j])
     # Store distance in dict
     dist[(i, j)] = dist_curr
     # Store max distance
     if dist_curr > max_dist:
       max_dist = dist_curr

EDIT 2: Pasted below are functions from Dschoni's answer.

# Euclidean distance
def euclidean(x,y):
  x = numpy.array(x)
  y = numpy.array(y) 
  return numpy.sqrt(numpy.sum((x-y)**2))

# Create matrix
def dist_mat(data):
  dist = {}
  for i in range(len(data)):
    for j in range(i + 1, len(data)):
      dist[(i, j)] = euclidean(data[i], data[j])
  return dist


# Returns i & k for max distance
def my_max(dict):
    return max(dict)

# Sort function
list1 = []
list2 = []
def sort (rcd, i, k):
  list1.append(i)
  list2.append(k)
  for j in range(len(rcd)):
    if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])):
      list1.append(j)
    else:
      list2.append(j)

EDIT 3: When I run the code provided by @Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset.

import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',')
df = df.drop(4, 1)
df = df[1:50]
data = df.values.tolist()

idl=range(len(data))
dist = create_distance_list(data)
print sort(dist, idl)

The result is as follows:

[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16, 39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41], [5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44], [31], [32], [19], [30], [35], [9, 47]]

Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function:

# Add small random noise
for key in actual:    
  actual[key] +=  np.random.normal(0, 0.005)

Any idea how to solve this problem properly?

2
I'm not entirely sure how to answer the question as asked. Could you expand on what you mean by "programming structure?" If you're referring to data structures, what part of your algorithm needs structures, and what are the requirements for the data structures? If all you want is to repeatedly split a collection of elements without caring about how the data is ultimately represented, an array or linked list might work. Alternatively, you could consider some form of tree (like a binary search tree).skeggse
I just need to repeatedly split the data points. I suspect that an array should be enough. Have you any idea where/how to start?Andrej
do you have a programming language in mind? You have an algorithm, so you just need to convert the steps of the algorithm into some code. I'd say start by writing a function.skeggse
I program in Python, but whichever language should be OK. I just need help with 2. and 3. step.Andrej
I attach the code from my initial attempts.Andrej

2 Answers

5
votes

A proper working example for the euclidean distance:

import numpy as np
#For random number generation


def create_distance_list(l):
'''Create a distance list for every
unique tuple of pairs'''
    dist={}
    for i in range(len(l)):
        for k in range(i+1,len(l)):
            dist[(i,k)]=abs(l[i]-l[k])
    return dist

def maximum(distance_dict):
'''Returns the key of the maximum value if unique
or a random key with the maximum value.'''
    maximum = max(distance_dict.values())
    max_key = [key for key, value in distance_dict.items() if value == maximum]
    if len(max_key)>1:
        random_key = np.random.random_integers(0,len(max_key)-1)
        return (max_key[random_key],)
    else:
        return max_key

def construct_new_dict(distance_dict,index_list):
'''Helper function to create a distance map for a subset
of data points.'''
    new={}
    for i in range(len(index_list)):
        for k in range(i+1,len(index_list)):
            m = index_list[i]
            n = index_list[k]
            new[(m,n)]=distance_dict[(m,n)]
    return new

def sort(distance_dict,idl,threshold=4):
    result=[idl]
    i=0
    try:
        while True:
            if len(result[i])>=2:
            actual=construct_new_dict(dist,result[i]) 
                act_max=maximum(actual)
                if distance_dict[act_max[0]]>threshold:
                    j = act_max[0][0]
                    k = act_max[0][1]
                    result[i].remove(j)
                    result[i].remove(k)
                    l1=[j]
                    l2=[k]
                    for iterr in range(len(result[i])):
                        s = result[i][iterr]
                        if s>j:
                            c1=(j,s)
                        else:
                            c1=(s,j)
                        if s>k:
                            c2=(k,s)
                        else: 
                            c2=(s,k)
                        if actual[c1]<actual[c2]:
                            l1.append(s)
                        else:
                            l2.append(s)
                    result.remove(result[i])
    #What to do if distance is equal?
                    l1.sort()
                    l2.sort()
                    result.append(l1)
                    result.append(l2)
                else:
                    i+=1
            else:
                i+=1
    except:
        return result


#This is the dataset
a = [1,2,2.5,5]
#Giving each entry a unique ID
idl=range(len(a))
dist = create_distance_list(a)
print sort(dist,idl)  

I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.

1
votes

Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function.

If Dmax > T then divide single cluster in two

Your description doesn't necessarily creates n clusters.
If a cluster has two records which has a distance less than T,
they will be clustered together (am I missing something?)