TL;DR: Have water flow time series needed to be treated, can't figure it out a way to remove outlier peaks.
I'm currently working in a project where I receive a .csv
dataset containing two columns:
- date, a
datetime
timestamp - value, a water flow value
This dataset is usually one year of measures of a water flow sensor of a management entity with automatic irrigation systems, containing around 402 000 raw values. Sometimes it can have some peaks that doesn't correspond to a watering period, because it's a punctual value between normal values, like in the image.
So far I've tried going with calculating the percentage differences between two points and the spacing and calculating the median absolute deviation (MAD) but both catch false positives.
The issue here is I need an algorithm that identifies a spontaneous peak that lasts 1 or 2 measures, because it's physically impossible to have a 300% increase in flow for 2 minutes.
The other issue is in coding. It is needed to have a dynamic way to detect these peaks because, according to the whole dataset we clearly see why: In the summer the flow increases to more than double, making impossible to go with a .95 percentile.
I've prepared a github repo with the techniques stated above and 1 day of the dataset, the one I'm currently using now (It's around 1000 values).