2
votes

I use Python:

I have 2 arrays of GPS points - lon and lat (more than 500 000 points).

I have 1 array of date-time.

lon = numpy.array(lon)
lat = numpy.array(lat)
dt = numpy.array(dt)

I have a location error (GPS sensor error). For example 15 meters.

GPS_sensor_error = 0.015    

I need to exclude a GPS_sensor_error from coordinates that there were no asterisks on a track.

enter image description here

(I don't draw a point with identical coordinates)

enter image description here

How I am able to do it?

Now:

  1. I calculate distance between points.

  2. I find the minimum distance, if it less GPS_sensor_error, then I average lon, lat.

  3. repeat 1.

  4. repeat 2.

  5. repeat until all distances won't be more GPS_sensor_error

Update:

     lon = numpy.array()
     lat = numpy.array()

     flag = True
     while flag:
        lon1 = lon[:-1]
        lon2 = lon[1:]
        lat1 = lat[:-1]
        lat2 = lat[1:]

        '''distance'''
        x = (lon2 - lon1)
        y = (lat2 - lat1)
        d = numpy.sqrt(x * x + y * y)

        min = numpy.min(d)
        if min < GPS_sensor_error:
            j = numpy.where(d == min)[0][0]

            lon[j] = (lon[j] + lon[j + 1]) / 2
            lat[j] = (lat[j] + lat[j + 1]) / 2

            lon = numpy.delete(lon, j + 1)
            lat = numpy.delete(lat, j + 1)

        else:
            flag = False

Bypass on all points works at a pure python very long... Prompt please, how to implement it using scipy, numpy?


Thanks

P.s. probably already there is a GPS filter in scipy, numpy?

2
I think I understand in principle what you want to achieve, however, details are unclear from your question. Could you give a simple example (3 or 4 points), that shows what you mean by distance, what means average lon,lat exactly in this context?Theodros Zelleke
Why don't you include your slow Python code? The way in which you want to conduct the averaging is not very clear from your deescription.Jaime
I added the slow Python code to the description.Olga
@Olga, in your slow algorithm lon and lat are shrinking with each iteration: lon = numpy.delete(lon, j + 1); lat = numpy.delete(lat, j + 1). Now in a comment on @spam_eggs answer you explicitely state, that the sizes of lon and lat should be preserved????Theodros Zelleke
As I said in my answer: Dont use pythagoras for the distance. It gets terribly inaccurate at higher and lower latitudes.RickyA

2 Answers

5
votes

From a data science perspective what you are doing is not correct. You cant just use the average error distance as a cutoff and think your data will be more correct. The two points you are comparing can have an error more or less than 15 m they can shift toward each other or move away of each other. And if you don't have another exact dataset there is no way of telling what would be the correct point. You can't make this dataset more precise.

However I think you goal is to simplify your dataset, not to make it more accurate. For that you can use the Douglas–Peucker algorithm. I would suggest you load your data in an Postgis enabled database (Postgresql + postgis) and then use the simplify function. This will require some db setup time, but then it will speed you up greatly. However if you want it in pure python this SO question has a very nice snippet.

BTW. If your are doing distance calculations with lat,lon do not use Pythagoras. It is not valid since lat,lon is not Euclidean. Use the haversine algorithm.

2
votes

You can easily do all your calculations using numpy primitives only and no python looping.

First define your distance function as a function that operates on numpy arrays (I assume you did that already ..):

def dist(lon1, lat1, lon2, lat2):
    """Compute the distance between (lon1, lat1) and (lon2, lat2). 
       Both may be numpy arrays."""
    ...

Then apply it to your data as so:

d = dist(lon[:-1], lat[:-1], lon[1:], lat[1:])

This notation means you will compare the ith point to the i+1th.

Next find the indices where d is greater than your threshold:

I = d > GPS_sensor_error

Now keep only those and the first point!

lon_out = numpy.hstack([[lon[0]], lon[1:][I]]) # could also use numpy.where
lat_out = numpy.hstack([[lat[0]], lat[1:][I]])

Update:

If you want to keep the same number of points, ie set lon[i] to the last good value, use the following trick instead of the previous two lines:

idx, = numpy.where(I)
idx = numpy.hstack([[0], idx])
J = numpy.cumsum(I) # the trick
lon_out = lon[idx[J]]
lat_out = lat[idx[J]]