3
votes

I've found various questions with solutions similar to this problem but nothing quite on the money so far. Very grateful for any help.

I have a mysql (v.5.6.10) database with a single table called POSTS that stores millions upon millions of rows of lat/long points of interest on a map. Each point is classified as one of several different types. Each row is structured as id, type, coords:

  • id an unsigned bigint + primary key. This is auto incremented for each new row that is inserted.
  • type an unsigned tinyint used to encode the type of the point of interest.
  • coords a mysql geospatial POINT datatype representing the lat/long of the point of interest.

There is a SPATIAL index on 'coords'.

I need to find an efficient way to query the table and return up to X of the most recently-inserted points within a radius ("R") of a specific lat/long position ("Position"). The database is very dynamic so please assume that the data is radically different each time the table is queried.

If X is infinite, the problem is trivial. I just need to execute a query something like:

SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position))

Where 'BoundingBox' is a mysql POLYGON datatype that perfectly encloses a circle of radius R from Position. Using a bounding box is, of course, not a perfect solution but this is not important for the particular problem that I'm trying to solve. I can order the results using "ORDER BY ID DESC" to retrieve and process the most-recently-inserted points first.

If X is less than infinite then I just need to modify the above to:

SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position)) ORDER BY id DESC LIMIT X

The problem that I am trying to solve is how do I obtain a good representative set of results from a given region on the map when the points in that region are heavily clustered (for example, within cities on the map search region). For example:

enter image description here

In the example above, I am standing at X and searching for the 5 most-recently-inserted points of type black within the black-framed bounding box. If these points were all inserted in the cluster in the bottom right hand corner (let's assume that cluster is London) then my set of results will not include the black point that is near the top right of the search region. This is a problem for my application as I do not want users to be given the impression that there are no points of interest outside any areas where points are clustered.

I have considered a few potential solutions but I can't find one that works efficiently when the number of rows is huge (10s of millions). Approaches that I have tried so far include:

  1. Dividing the search region into S number of squares (i.e., turning it into a grid) and searching for up to x/S points within each square - i.e., executing a separate mysql query for each square in the grid. This works OK for a small number of rows but becomes inefficient when the number of rows is massive as you need to divide the region into a large number of squares for the approach to work effectively. With only a small number of squares, you cannot guarantee that each square won't contain a densely populated cluster. A large number of squares means a large number of mysql searches which causes things to chug.

  2. Adding a column to each row in the table that stores the distance to the nearest neighbour for each point. The nearest neighbour distance for a given point is calculated when the point is inserted into the table. With this structure, I can then order the search results by the nearest neighbour distance column so that any points that are in clusters are returned last. This solution only works when I'm searching for ALL points within the search region. For example, consider the situation in the diagram shown above. If I want to find the 5 most-recently-inserted points of type green, the nearest neighbour distance that is recorded for each point will not be correct. Recalculating these distances for each and every query is going to be far too expensive, even using efficient algorithms like KD trees.

In fact, I can't see any approach that requires pre-processing of data in table rows (or, put another way, 'touching' every point in the relevant search region dataset) to be viable when the number of rows gets large. I have considered algorithms like k-means / DBSCAN, etc. and I can't find anything that will work with sufficient efficiency given the use case explained above.

Any pearls? My intuition tells me this CAN be solved but I'm stumped so far.

1
If you're supposed to only display 5 places, why do you feel the need to imply the existence of the top right one? What about using a more obvious color and cluster icon for the 5 first cases and cluster them first, then cluster the others after. I've done this approach in Google Maps.Robin Castlin

1 Answers

1
votes

Post-processing in that case seems more effective. Fetch last X points of a given type. Find if there is some clustering, for example: too many points too close together, relative to the distance of your point of view. Drop oldest of them (or these which are very close - may be your data is referencing a same POI). How much - up to you. Fetch next X points and see if there are some of them which are not in the cluster, or you can calculate a value for each of them based on remoteness and recentness and discard points according to that value.