I have a list of addresses for many people (1-8 addresses each) and I'm trying to identify the number of unique addresses each person has.
here is a sample address dataset for one person
#df[df['ID'] =='12345'][['address','zip]].values
addresses = [['PULMONARY MED ASSOC MED GROUP INC 1485 RIVER PARK DR STE 200',
'95815'],
['1485 RIVER PARK DRIVE SUITE 200', '95815'],
['1485 RIVER PARK DR SUITE 200', '95815'],
['3637 MISSION AVE SUITE 7', '95608']]
I've got an address parser that separates the different portions of the address, the "attn", house number, street name, PO Box, etc so that I can compare them individually (code found here)
As you can see from the data above, addresses 1-3 are probably the same, and address 4 is different.
I wrote the following similarity calculation method - there's no magic to the weights, just what my intuition said should be most important
def calcDistance(a1, a2,z1,z2, parser):
z1 = str(z1)
z2 = str(z2)
add1 = parser.parse(a1)
add2 = parser.parse(a2)
zip_dist = 0 if z1 == z2 else distance.levenshtein(z1,z2)
zip_weight = .4
attn_dist = distance.levenshtein(add1['attn'],add2['attn']) if add1['attn'] and add2['attn'] else 0
attn_weight = .1 if add1['attn'] and add2['attn'] else 0
suite_dist = distance.levenshtein(add1['suite_num'],add2['suite_num']) if add1['suite_num'] and add2['suite_num'] else 0
suite_weight = .1 if add1['suite_num'] and add2['suite_num'] else 0
street_dist = distance.levenshtein(add1['street_name'],add2['street_name']) if add1['street_name'] and add2['street_name'] else 0
street_weight = .3 if add1['street_name'] and add2['street_name'] else 0
house_dist = distance.levenshtein(add1['house'],add2['house']) if add1['house'] and add2['house'] else 0
house_weight = .1 if add1['house'] and add2['house'] else 0
weight = (zip_dist * zip_weight + attn_dist * attn_weight + suite_dist * suite_weight + street_dist * street_weight
+ house_dist * house_weight ) / (zip_weight +attn_weight + suite_weight + street_weight + house_weight )
return weight
Applying this function to each of my addresses, you can see that addresses 1-3 correctly are completely similar, and address 4 is a bit different.
similarity = -1*np.array([[calcDistance(a1[0],a2[0],a1[1],a2[1],addr_parser) for a1 in addresses] for a2 in addresses])
print similarity
array([[-0. , -0. , -0. , -5.11111111],
[-0. , -0. , -0. , -5.11111111],
[-0. , -0. , -0. , -5.11111111],
[-5.11111111, -5.11111111, -5.11111111, -0. ]])
To then cluster these, I thought affinity clustering might be the best way to go - the cluster count is variable, it works with distances, and can identify a prototypical example, which I could then use the "best" address to represent the cluster. However, I'm getting some strange results - the affinityprop clusterer is producing 3 clusters for this data, instead of 2.
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=.5)
affprop.fit(similarity)
print affprop.labels_
array([0, 0, 1, 2], dtype=int64)
Conversely, DBSCAN correctly clusters into two
dbscan = sklearn.cluster.DBSCAN(min_samples=1)
dbscan.fit(similarity)
print dbscan.labels_
array([0, 0, 0, 1], dtype=int64)
Looking at this question, it seems that the issues is the clusterer is adding small random starting points, and counting the perfectly similar records as degeneracies.
Is there a way around this or should I just give up with the affinity clustering and stick to DBSCAN?