33
votes

I need help selecting or creating a clustering algorithm according to certain criteria.

Imagine you are managing newspaper delivery persons.

  • You have a set of street addresses, each of which is geocoded.
  • You want to cluster the addresses so that each cluster is assigned to a delivery person.
  • The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
  • Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
  • For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.

So... ideas?

UPDATE

The street network graph, as described in Arachnid's answer, is not available.

17
So are you really trying to equalize delivery time for each cluster (which presumably corresponds to travel time)?ctd
I was thinking homework until the "crazy" line. That made it smell like "overworked coder" :)alphadogg
@alphadogg which is the crazy line?carrier
@carrier: yeah, the last one. Teachers would not be concerned with hypothetical delivery persons... :)alphadogg
@Alphadog Dunno about your teachers but mine would have been (Esp. as extra credit)... Then again mine were a tad saddistic...Omar Kooheji

17 Answers

10
votes

I think you want a hierarchical agglomeration technique rather than k-means. If you get your algorithm right you can stop it when you have the right number of clusters. As someone else mentioned you can seed subsequent clusterings with previous solutions which may give you a siginificant performance improvement.

You may want to look closely at the distance function you use, especially if your problem has high dimension. Euclidean distance is the easiest to understand but may not be the best, look at alternatives such as Mahalanobis.

I'm presuming that your real problem has nothing to do with delivering newspapers...

22
votes

I've written an inefficient but simple algorithm in Java to see how close I could get to doing some basic clustering on a set of points, more or less as described in the question.

The algorithm works on a list if (x,y) coords ps that are specified as ints. It takes three other parameters as well:

  1. radius (r): given a point, what is the radius for scanning for nearby points
  2. max addresses (maxA): what are the maximum number of addresses (points) per cluster?
  3. min addresses (minA): minimum addresses per cluster

Set limitA=maxA. Main iteration: Initialize empty list possibleSolutions. Outer iteration: for every point p in ps. Initialize empty list pclusters. A worklist of points wps=copy(ps) is defined. Workpoint wp=p. Inner iteration: while wps is not empty. Remove the point wp in wps. Determine all the points wpsInRadius in wps that are at a distance < r from wp. Sort wpsInRadius ascendingly according to the distance from wp. Keep the first min(limitA, sizeOf(wpsInRadius)) points in wpsInRadius. These points form a new cluster (list of points) pcluster. Add pcluster to pclusters. Remove points in pcluster from wps. If wps is not empty, wp=wps[0] and continue inner iteration. End inner iteration. A list of clusters pclusters is obtained. Add this to possibleSolutions. End outer iteration.

We have for each p in ps a list of clusters pclusters in possibleSolutions. Every pclusters is then weighted. If avgPC is the average number of points per cluster in possibleSolutions (global) and avgCSize is the average number of clusters per pclusters (global), then this is the function that uses both these variables to determine the weight:

  private static WeightedPClusters weigh(List<Cluster> pclusters, double avgPC, double avgCSize)
  {
    double weight = 0;
    for (Cluster cluster : pclusters)
    {
      int ps = cluster.getPoints().size();
      double psAvgPC = ps - avgPC;
      weight += psAvgPC * psAvgPC / avgCSize;
      weight += cluster.getSurface() / ps;
    }
    return new WeightedPClusters(pclusters, weight);
  }

The best solution is now the pclusters with the least weight. We repeat the main iteration as long as we can find a better solution (less weight) than the previous best one with limitA=max(minA,(int)avgPC). End main iteration.

Note that for the same input data this algorithm will always produce the same results. Lists are used to preserve order and there is no random involved.

To see how this algorithm behaves, this is an image of the result on a test pattern of 32 points. If maxA=minA=16, then we find 2 clusters of 16 addresses.

alt text
(source: paperboyalgorithm at sites.google.com)

Next, if we decrease the minimum number of addresses per cluster by setting minA=12, we find 3 clusters of 12/12/8 points.

alt text
(source: paperboyalgorithm at sites.google.com)

And to demonstrate that the algorithm is far from perfect, here is the output with maxA=7, yet we get 6 clusters, some of them small. So you still have to guess too much when determining the parameters. Note that r here is only 5.

alt text
(source: paperboyalgorithm at sites.google.com)

Just out of curiosity, I tried the algorithm on a larger set of randomly chosen points. I added the images below.

Conclusion? This took me half a day, it is inefficient, the code looks ugly, and it is relatively slow. But it shows that it is possible to produce some result in a short period of time. Of course, this was just for fun; turning this into something that is actually useful is the hard part.

alt text
(source: paperboyalgorithm at sites.google.com)

alt text
(source: paperboyalgorithm at sites.google.com)

10
votes

What you are describing is a (Multi)-Vehicle-Routing-Problem (VRP). There's quite a lot of academic literature on different variants of this problem, using a large variety of techniques (heuristics, off-the-shelf solvers etc.). Usually the authors try to find good or optimal solutions for a concrete instance, which then also implies a clustering of the sites (all sites on the route of one vehicle).

However, the clusters may be subject to major changes with only slightly different instances, which is what you want to avoid. Still, something in the VRP-Papers may inspire you...

If you decide to stick with the explicit clustering step, don't forget to include your distribution in all clusters, as it is part of each route.

For evaluating the clusters using a graph representation of the street grid will probably yield more realistic results than connecting the dots on a white map (although both are TSP-variants). If a graph model is not available, you can use the taxicab-metric (|x_1 - x_2| + |y_1 - y_2|) as an approximation for the distances.

6
votes

Have you thought about using an economic/market based solution? Divide the set up by an arbitrary (but constant to avoid randomness effects) split into even subsets (as determined by the number of delivery persons).

Assign a cost function to each point by how much it adds to the graph, and give each extra point an economic value.

Iterate allowing each person in turn to auction their worst point, and give each person a maximum budget.

This probably matches fairly well how the delivery people would think in real life, as people will find swaps, or will say "my life would be so much easier if I didn't do this one or two. It is also pretty flexible (for example, would allow one point miles away from any others to be given a premium fairly easily).

4
votes

I would approach it differently: Considering the street network as a graph, with an edge for each side of each street, find a partitioning of the graph into n segments, each no more than a given length, such that each paperboy can ride a single continuous path from the start to the end of their route. This way, you avoid giving people routes that require them to ride the same segments repeatedly (eg, when asked to cover both sides of a street without covering all the surrounding streets).

4
votes

This is a very quick and dirty method of discovering where your "clusters" lie. This was inspired by the game "Minesweeper."

Divide your entire delivery space up into a grid of squares. Note - it will take some tweaking of the size of the grid before this will work nicely. My intuition tells me that a square size roughly the size of a physical neighbourhood block will be a good starting point.

Loop through each square and store the number of delivery locations (houses) within each block. Use a second loop (or some clever method on the first pass) to store the number of delivery points for each neighbouring block.

Now you can operate on this grid in a similar way to photo manipulation software. You can detect the edges of clusters by finding blocks where some neighbouring blocks have no delivery points in them.

Finally you need a system that combines number of deliveries made as well as total distance travelled to create and assign routes. There may be some isolated clusters with just a few deliveries to be made, and one or two super clusters with many homes very close to each other, requiring multiple delivery people in the same cluster. Every home must be visited, so that is your first constraint.

Derive a maximum allowable distance to be travelled by any one delivery person on a single run. Next do the same for the number of deliveries made per person.

The first ever run of the routing algorithm would assign a single delivery person, send them to any random cluster with not all deliveries completed, let them deliver until they hit their delivery limit or they have delivered to all the homes in the cluster. If they have hit the delivery limit, end the route by sending them back to home base. If they could safely travel to the nearest cluster and then home without hitting their max travel distance, do so and repeat as above.

Once the route is finished for the current delivery person, check if there are homes that have not yet had a delivery. If so, assign another delivery person, and repeat the above algorithm.

This will generate initial routes. I would store all the info - the location and dimensions of each square, the number of homes within a square and all of its direct neighbours, the cluster to which each square belongs, the delivery people and their routes - I would store all of these in a database.

I'll leave the recalc procedure up to you - but having all the current routes, clusters, etc in a database will enable you to keep all historic routes, and also try various scenarios to see how to best to adapt to changes creating the least possible changes to existing routes.

3
votes

This is a classic example of a problem that deserves an optimized solution rather than trying to solve for "The OPTIMUM". It's similar in some ways to the "Travelling Salesman Problem", but you also need to segment the locations during the optimization.

I've used three different optimization algorithms to good effect on problems like this:

  1. Simulated Annealing
  2. Great Deluge Algorithm
  3. Genetic Algoritms

Using an optimization algorithm, I think you've described the following "goals":

  1. The geographic area for each paper boy should be minimized.
  2. The number of subscribers served by each should be approximately equal.
  3. The distance travelled by each should be about equal.
  4. (And one you didn't state, but might matter) The route should end where it began.

Hope this gets you started!

* Edit *

If you don't care about the routes themselves, that eliminates goals 3 and 4 above, and perhaps allows the problem to be more tailored to your bonus requirements.

If you take demographic information into account (such as population density, subscription adoption rate and subscription cancellation rate) you could probably use the optimization techniques above to eliminate the need to rerun the algorithm at all as subscribers adopted or dropped your service. Once the clusters were optimized, they would stay in balance because the rates of each for an individual cluster matched the rates for the other clusters.

The only time you'd have to rerun the algorithm was when and external factor (such as a recession/depression) caused changes in the behavior of a demographic group.

2
votes

Rather than a clustering model, I think you really want some variant of the Set Covering location model, with an additional constraint to cover the number of addresses covered by each facility. I can't really find a good explanation of it online. You can take a look at this page, but they're solving it using areal units and you probably want to solve it in either euclidean or network space. If you're willing to dig up something in dead tree format, check out chapter 4 of Network and Discrete Location by Daskin.

2
votes

Good survey of simple clustering algos. There is more though: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

2
votes

Perhaps a minimum spanning tree of the customers, broken into set based on locality to the paper boy. Prims or Kruskal to get the MST with the distance between houses for the weight.

2
votes

I know of a pretty novel approach to this problem that I have seen applied to Bioinformatics, though it is valid for any sort of clustering problem. It's certainly not the simplest solution but one that I think is very interesting. The basic premise is that clustering involves multiple objectives. For one you want to minimise the number of clusters, the trival solution being a single cluster with all the data. The second standard objective is to minimise the amount of variance within a cluster, the trivial solution being many clusters each with only a single data point. The interesting solutions come about when you try to include both of these objectives and optimise the trade-off.

At the core of the proposed approach is something called a memetic algorithm that is a little like a genetic algorithm, which steve mentioned, however it not only explores the solution space well but also has the ability to focus in on interesting regions, i.e. solutions. At the very least I recommend reading some of the papers on this subject as memetic algorithms are an unusual approach, though a word of warning; it may lead you to read The Selfish Gene and I still haven't decided whether that was a good thing... If algorithms don't interest you then maybe you can just try and express your problem as the format requires and use the source code provided. Related papers and code can be found here: Multi Objective Clustering

2
votes

This is not directly related to the problem, but something I've heard and which should be considered if this is truly a route-planning problem you have. This would affect the ordering of the addresses within the set assigned to each driver.

UPS has software which generates optimum routes for their delivery people to follow. The software tries to maximize the number of right turns that are taken during the route. This saves them a lot of time on deliveries.

For people that don't live in the USA the reason for doing this may not be immediately obvious. In the US people drive on the right side of the road, so when making a right turn you don't have to wait for oncoming traffic if the light is green. Also, in the US, when turning right at a red light you (usually) don't have to wait for green before you can go. If you're always turning right then you never have to wait for lights.

There's an article about it here: http://abcnews.go.com/wnt/story?id=3005890

1
votes

You can have K means or expected maximization remain as unchanged as possible by using the previous cluster as a clustering feature. Getting each cluster to have the same amount of items seems bit trickier. I can think of how to do it as a post clustering step by doing k means and then shuffling some points until things balance but that doesn't seem very efficient.

1
votes

A trivial answer which does not get any bonus points:

One delivery person for each address.

1
votes
  • You have a set of street addresses, each of which is geocoded.
    • You want to cluster the addresses so that each cluster is assigned to a delivery person.
    • The number of delivery persons, or clusters, is not fixed. If needed, I can always hire more delivery persons, or lay them off.
    • Each cluster should have about the same number of addresses. However, a cluster may have less addresses if a cluster's addresses are more spread out. (Worded another way: minimum number of clusters where each cluster contains a maximum number of addresses, and any address within cluster must be separated by a maximum distance.)
    • For bonus points, when the data set is altered (address added or removed), and the algorithm is re-run, it would be nice if the clusters remained as unchanged as possible (ie. this rules out simple k-means clustering which is random in nature). Otherwise the delivery persons will go crazy.

As has been mentioned a Vehicle Routing Problem is probably better suited... Although strictly isn't designed with clustering in mind, it will optimize to assign based on the nearest addresses. Therefore you're clusters will actually be the recommended routes.

If you provide a maximum number of deliverers then and try to reach the optimal solution this should tell you the min that you require. This deals with point 2.

The same number of addresses can be obtained by providing a limit on the number of addresses to be visited, basically assigning a stock value (now its a capcitated vehicle routing problem).

Adding time windows or hours that the delivery persons work helps reduce the load if addresses are more spread out (now a capcitated vehicle routing problem with time windows).

If you use a nearest neighbour algorithm then you can get identical results each time, removing a single address shouldn't have too much impact on your final result so should deal with the last point.

I'm actually working on a C# class library to achieve something like this, and think its probably the best route to go down, although not neccesairly easy to impelement.

1
votes

I acknowledge that this will not necessarily provide clusters of roughly equal size:

One of the best current techniques in data clustering is Evidence Accumulation. (Fred and Jain, 2005) What you do is:

Given a data set with n patterns.

  1. Use an algorithm like k-means over a range of k. Or use a set of different algorithms, the goal is to produce an ensemble of partitions.

  2. Create a co-association matrix C of size n x n.

  3. For each partition p in the ensemble:
    3.1 Update the co-association matrix: for each pattern pair (i, j) that belongs to the same cluster in p, set C(i, j) = C(i, j) + 1/N.

  4. Use a clustering algorihm such as Single Link and apply the matrix C as the proximity measure. Single Link gives a dendrogram as result in which we choose the clustering with the longest lifetime.

I'll provide descriptions of SL and k-means if you're interested.

-1
votes

I would use a basic algorithm to create a first set of paperboy routes according to where they live, and current locations of subscribers, then:

when paperboys are:

  • Added: They take locations from one or more paperboys working in the same general area from where the new guy lives.
  • Removed: His locations are given to the other paperboys, using the closest locations to their routes.

when locations are:

  • Added : Same thing, the location is added to the closest route.
  • Removed: just removed from that boy's route.

Once a quarter, you could re-calculate the whole thing and change all the routes.