Combining a BTREE and Geographical / SPATIAL index to speed up my query?

Question

I have a mysql myisam table with 500,000 rows. In this table I have information of different type of places and latitude & longitude coordinates. Depending on the user I would like to select within a certain distance from a point defined by latitude and longitude a certain type of places.

I have a spatial index and a multi column index on latitude, longitude, type. Those indexes on itsel work well if the number of rows within a certain area is not too big.

The problem is that in some cases I need to use a very large radius from a certain point (defined by latitude, longitude coordinates) because there are very few places of the type required. However the problem is that when I search for a certain type, say "x" mysql searches for around 20,000 rows as my radius is large, say "200 km". However in the real world there are only 5 places with type "x" within 200 km from a certain point.

I read somewhere that BTREE and SPATIAL indexes cannot be combined. However I want to work towards a solution where I am able to select those 5 places very quickly based on the input of latitude, longitude and type.

I tried the below 2 approaches:

APPROACH 1 - spatial index:

SELECT * FROM destinations 
WHERE MBRWithin(lat_lng_point, GeomFromText('Polygon((49.8413216059 12.8478000082, 48.0426783941 12.8478000082, 48.0426783941 15.5861999918, 49.8413216059 15.5861999918, 49.8413216059 12.8478000082))')) 
AND destinations.type = 'x'

APPROACH 2 - multi column index on latitude, longitude, type:

SELECT * FROM destinations FORCE INDEX (lat_long_type_main)
WHERE latitude > 49.7786783941 AND latitude < 51.5773216059 
AND longitude > 10.0927907742 AND longitude < 12.9312092258 
AND type = 'x'

Approach 1 is still much faster than approach 2, as they take 2 to 5 seconds respectively. Also the number of rows that is scanned (by using explain) is larger with the second approach than with the first approach.

With approach 1 and approach 2 the number of rows in the explain is exactly the number of rows within the specified region by the geocoordinates, discarding the type. I can understand that for approach 1 the type is not in the index, but not for approach 2 I would not expect a large table scan for type as type is in the index.

If I could make an index that would directly return the 5 points using an index on latitude,longitude and type I expect this query to be much faster.

As I have a number of such queries it is very important to speed them up. I will be very thankfull for your help.

The spatial index will almost always be faster. Are you looking for a specific number or responses? - you seem to hint at 5 but not clarify that is what you want. Additionally, you do not want to use MBR_Within as it actually takes the bounding rectangle of your radius (therefore a square) - you should be able to use ST_Within instead to gain accuracy. — Jon Bellamy
@JonBellamy I'm not looking for a specific number or response, I only want to improve performance. The problem is that I have a WHERE clause that specifies the type and mysql seems to get all types of destinations first and then does a table scan to get only those results where type = x, and that takes a long time as there are many rows with other types. What I want is an index that takes into account geographical coordinates and the type (of a destination). — BastiaanWW
Many wrong assumptions here. You can only make a range query on ONE field in an index. If you have index like (lat, long, type), MySQL will stop using the index already after "lat". The rest of the index is useless. If you index like (type, lat, long). MySQL can first filter on the type and then lat, but till not be able to use the long range. In this case, you can make 2 indexes. One with (type,lat) and one with (type,long) and let MySQL pick the one that hits the least amount of rows. But it will still not be as fast as a SPATIAL index, as those are optimized for ranged queries. — Daniele Testa

O. Jones O. Jones · Accepted Answer · 2014-06-08T22:15:33

If all you need is a bounding-rectangle search the spatial index will yield best performance.

But that's not what you need. You need, I believe, to search for a certain single value in a type column, and a lat/long bounding box range. It is not possible to create a compound index that has a spatial component and also indexes some other column.

Use FLOAT or DOUBLE data for latitude and longitude

Use the FLOAT or DOUBLE data type for the latitude and longitude columns for fastest searches. FLOAT has plenty of precision for GPS-resolution place finder applications. DOUBLE will also work well. Because FLOAT data items take four bytes each and DOUBLE take eight, you'll find that FLOAT is slightly faster to look up. But it's a marginal improvement.

You could use DECIMAL(8,4) or some similar data type for lat/long. But FLOAT is just as good and noticeably faster.

If your lat/long values are in varchar() columns you will either get mistakes in your results or very slow queries, because the range scanning operation won't work correctly.

Use a compound BTREE index

To do this, I believe your best solution is to create a compound BTREE index on (type, latitude, longitude). MySQL will random-access this index using the type value you specify and the lower bound latitude values you want, and then will range-scan the index until it gets to the upper bound latitude.

Explanation of index range scan

Here's an explanation of that. BTREE indexes can be randomly accessed looking for a particular value, or accessed in sequence from any starting point looking for the next value. Here's an example. Suppose you have an index on the column called data, and it contains rows with values

If you specify WHERE data BETWEEN 4 AND 9, MySQL will random-access the index to the first value greater than or equal to 4, then sequentially access it until it gets to the last value less than or equal to 9. This is called a range scan, and looks like this.

 1
 2
 3
 5    <-- random access to here.
 5    <-- scan to here
 6    <-- ... and here
 8    <-- ... and here
 9    <-- ... and here
11    <-- stop scanning right before this row.

This scanning is very fast.

Explanation of compound index range scan

Now let's consider a compound index, as in your question, on type and latitude. This index might have values like these in it.

type  latitude
 a    49.5
 a    49.8
 a    49.9 
 a    52.0
 b    58.3
 x    49.5
 x    49.8   <-- random access to here 
 x    51.2   <-- ... scan to here
 x    51.8   <-- stop scanning right before this row
 y    49.0
 y    49.5

A query that looks like WHERE type='x' AND latitude BETWEEN 49.7 AND 51.5 can use this same range scan trick. It looks for the first row to capture, then scans to the last row. The order of columns in the compound index matters because the sequential ordering is on the concatenation of the column values.

Looking up lat/long locations of a single type

You can use the second query in your question, or some variant on it, to exploit the index I'm suggesting.

SELECT * 
  FROM destinations
 WHERE latitude  BETWEEN 49.7786783941 AND 51.5773216059 
   AND longitude BETWEEN 10.0927907742 AND 12.9312092258 
   AND type = 'x'

I am not sure whether you're better off with longitude included in the index. That's worth an experiment.

Improve performance by avoiding SELECT *

Pro tip: Avoid SELECT * in queries like this. If you enumerate the fields you need from the query, you may be able to create a covering index that can satisfy the query directly. That will be very fast. For example, if your query is

SELECT airport_code, name, latitude, longitude
  FROM destinations
 WHERE latitude  BETWEEN 49.7786783941 AND 51.5773216059 
   AND longitude BETWEEN 10.0927907742 AND 12.9312092258 
   AND type = 'x'

Then your query can be satisfied directly by a range scan on this compound BTREE index.

(type, latitude, longitude, airport_code, name)

Note: you don't have to do anything special to create a BTREE index. That's the default.

Don't overstate your lat/lon precision

Pro tip: you may be deceiving yourself by giving coordinates with precision such as 51.5773216059. That's an apparent precision of about 11 micrometers. GPS is only good to about 5 meters, and the not-quite-spherical shape of the earth causes simple lat-long based distance computations to break down at that same level.

Edit I just ran an experiment with my zipcode test data, and creating the compound index helps a great deal.