Affinity Propagation Clustering for Addresses

Question

I have a list of addresses for many people (1-8 addresses each) and I'm trying to identify the number of unique addresses each person has.

here is a sample address dataset for one person

#df[df['ID'] =='12345'][['address','zip]].values
addresses = [['PULMONARY MED ASSOC MED GROUP INC 1485 RIVER PARK DR STE 200',
        '95815'],
       ['1485 RIVER PARK DRIVE SUITE 200', '95815'],
       ['1485 RIVER PARK DR SUITE 200', '95815'],
       ['3637 MISSION AVE SUITE 7', '95608']]

I've got an address parser that separates the different portions of the address, the "attn", house number, street name, PO Box, etc so that I can compare them individually (code found here)

As you can see from the data above, addresses 1-3 are probably the same, and address 4 is different.

I wrote the following similarity calculation method - there's no magic to the weights, just what my intuition said should be most important

def calcDistance(a1, a2,z1,z2, parser):

    z1 = str(z1)
    z2 = str(z2)
    add1 = parser.parse(a1)
    add2 = parser.parse(a2)

    zip_dist = 0 if z1 == z2 else distance.levenshtein(z1,z2)
    zip_weight = .4

    attn_dist = distance.levenshtein(add1['attn'],add2['attn']) if add1['attn'] and add2['attn'] else 0
    attn_weight = .1 if add1['attn'] and add2['attn'] else 0

    suite_dist = distance.levenshtein(add1['suite_num'],add2['suite_num']) if add1['suite_num'] and add2['suite_num'] else 0
    suite_weight = .1 if add1['suite_num'] and add2['suite_num'] else 0

    street_dist = distance.levenshtein(add1['street_name'],add2['street_name']) if add1['street_name'] and add2['street_name'] else 0
    street_weight = .3 if add1['street_name'] and add2['street_name'] else 0

    house_dist = distance.levenshtein(add1['house'],add2['house']) if add1['house'] and add2['house'] else 0
    house_weight = .1 if add1['house'] and add2['house'] else 0

    weight = (zip_dist * zip_weight + attn_dist * attn_weight + suite_dist * suite_weight + street_dist * street_weight
            + house_dist * house_weight ) / (zip_weight +attn_weight + suite_weight + street_weight + house_weight )

    return weight

Applying this function to each of my addresses, you can see that addresses 1-3 correctly are completely similar, and address 4 is a bit different.

similarity = -1*np.array([[calcDistance(a1[0],a2[0],a1[1],a2[1],addr_parser) for a1 in addresses] for a2 in addresses])

print similarity 

array([[-0.        , -0.        , -0.        , -5.11111111],
       [-0.        , -0.        , -0.        , -5.11111111],
       [-0.        , -0.        , -0.        , -5.11111111],
       [-5.11111111, -5.11111111, -5.11111111, -0.        ]])

To then cluster these, I thought affinity clustering might be the best way to go - the cluster count is variable, it works with distances, and can identify a prototypical example, which I could then use the "best" address to represent the cluster. However, I'm getting some strange results - the affinityprop clusterer is producing 3 clusters for this data, instead of 2.

affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.5)
affprop.fit(similarity)

print affprop.labels_
array([0, 0, 1, 2], dtype=int64)

Conversely, DBSCAN correctly clusters into two

dbscan = sklearn.cluster.DBSCAN(min_samples=1)
dbscan.fit(similarity)

print dbscan.labels_
array([0, 0, 0, 1], dtype=int64)

Looking at this question, it seems that the issues is the clusterer is adding small random starting points, and counting the perfectly similar records as degeneracies.

Is there a way around this or should I just give up with the affinity clustering and stick to DBSCAN?

Tchotchke Tchotchke · Accepted Answer · 2017-02-09T18:10:22

While I would suspect that this issue would go away with larger samples of different groups (see example below), in your case It looks like you'll want to increase the damping factor to get the desired result. Starting at .95 you get the correct grouping:

>>> affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.95)
>>> affprop.fit(similarity)
AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
          damping=0.95, max_iter=200, preference=None, verbose=False)
>>> print affprop.labels_
[0 0 0 1]

As I mentioned initially, this problem seems to go away as you add more different data to your set. Looking at the example in the question you referenced, we see they initially have the same issue:

>>> c = [[0], [0], [0], [0], [0], [0], [0], [0]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean').fit (c)
>>> print (af.labels_)
[0 1 0 1 2 1 1 0]

This goes away with higher damping:

>>> c = [[0], [0], [0], [0], [0], [0], [0], [0]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean', damping=.99).fit (c)
>>> print (af.labels_)
[0 0 0 0 0 0 0 0]

Or as we introduce more groups:

>>> c = [[0], [0], [0], [1], [2], [1], [2], [1]]
>>> af = sklearn.cluster.AffinityPropagation (affinity = 'euclidean', damping=.5).fit (c)
>>> print (af.labels_)
[0 0 0 2 1 2 1 2]

Affinity Propagation Clustering for Addresses

1 Answers