How to detect outlier peaks in a water flow time series?

Question

TL;DR: Have water flow time series needed to be treated, can't figure it out a way to remove outlier peaks.

I'm currently working in a project where I receive a .csv dataset containing two columns:

date, a datetime timestamp
value, a water flow value

This dataset is usually one year of measures of a water flow sensor of a management entity with automatic irrigation systems, containing around 402 000 raw values. Sometimes it can have some peaks that doesn't correspond to a watering period, because it's a punctual value between normal values, like in the image.

So far I've tried going with calculating the percentage differences between two points and the spacing and calculating the median absolute deviation (MAD) but both catch false positives.

The issue here is I need an algorithm that identifies a spontaneous peak that lasts 1 or 2 measures, because it's physically impossible to have a 300% increase in flow for 2 minutes.

The other issue is in coding. It is needed to have a dynamic way to detect these peaks because, according to the whole dataset we clearly see why: In the summer the flow increases to more than double, making impossible to go with a .95 percentile.

I've prepared a github repo with the techniques stated above and 1 day of the dataset, the one I'm currently using now (It's around 1000 values).

Stef Stef · Accepted Answer · 2020-06-06T22:28:52

Not a real answer but too long for a comment:

Maybe you could use the prominence of the peaks. You can use find_peaks with the prominence and width parameters and try and tweak other parameters like window size for prominence calculation (wlen).

The following quick example only illustrates the usage. It just finds peaks with a minimum prominence of arbitrarily 3 times the median:

from scipy.signal import find_peaks
df = pd.read_csv('https://raw.githubusercontent.com/MigasTigas/peak_removal/master/dataset_simple_example.csv')
peaks,_ = find_peaks(df.value, prominence=df.value.median()*3, width=(1,2))
ax = df.plot()
df.iloc[peaks.tolist()].plot(style=['x'], ax=ax)

How to detect outlier peaks in a water flow time series?

1 Answers