1
votes

I'm working with a dataset containing measures combined with a datetime like:

datetime value
2017-01-01 00:01:00,32.7
2017-01-01 00:03:00,37.8
2017-01-01 00:04:05,35.0
2017-01-01 00:05:37,101.1
2017-01-01 00:07:00,39.1
2017-01-01 00:09:00,38.9

I'm trying to detect and remove potential peaks that might appear, like 2017-01-01 00:05:37,101.1 measure.

Some things that I found so far:

  • This dataset has a time spacing that goes from 15 seconds all the way to 25 minutes, making it super uneven;
  • The width of the peaks cannot be determined beforehand
  • The height of the peaks clearly and significantly deviates from the other values
  • Normalization of the time step should only occur after the removal of the outliers since they would interfere with the results

  • It's "impossible" to making it even due to other anomalies (e.g, negative values, flat lines), even without them it would create wrong values due to the peaks;

  • find_peaks is expecting an evenly spaced timeseries therefore the previous solution didn't work for the irregular timeseries we have;
    • On that issue I forgot to mention the critical point that is unevenly spaced timeseries.

I've searched everywhere and I couldn't find anything. The implementation is going to be in Python but I'm willing to dig around other languages to get the logic.

1
You need to define what makes reading an outlier. That said, I don't see how unevenness is relevant (let alone critical).user58697
By creating a rolling window? In water flow time series a peak is stated to be an abnormal value between 3 consecutive measures, however these 3 measures need to be happening in less than, say, 5 minutes, because it's physically impossible to have a flow of 25 m^3 in one minute and then 110 m^3 the very next minute. [...]MigasTigas
[...] Sadly the sensors don't measure the times right, either measures in 50 seconds or can go all the way to 25 minutes, like stated. If in the rolling window we have 6 measures but the timings are like [56,62,64,353,64,67]seconds, if a peak is in the 4th position, those 5 lost minutes could be something else that justifies that high value.MigasTigas
Ah. These tiny details make all the difference. If I now understand you correctly, you have an apriori knowledge on how fast the measured value may change. I would start with something along the lines if ((flow[i+1] - flow[i]) / (time[i+1] - time[i]) > threshold)user58697
This is something only you (as the one who possesses the domain knowledge) may answer.user58697

1 Answers

0
votes

I've posted this code on github to anyone that in the future have this problem, or similar.

After a lot of trial and error I think I created something that works. Using what @user58697 told me I managed to create a code that detects every peak between a threshold.

By using the logic that he/she explained if ((flow[i+1] - flow[i]) / (time[i+1] - time[i]) > threshold I've coded the following code:

Started by reading the .csv and parse the dates, followed by splitting into two numpy arrays:

dataset = pd.read_csv('https://raw.githubusercontent.com/MigasTigas/peak_removal/master/dataset_simple_example.csv', parse_dates=['date'])

dataset = dataset.sort_values(by=['date']).reset_index(drop=True).to_numpy()  # Sort and convert to numpy array

# Split into 2 arrays
values = [float(i[1]) for i in dataset]  # Flow values, in float
values = np.array(values)

dates = [i[0].to_pydatetime() for i in dataset]
dates = np.array(dates)

Then applied the (flow[i+1] - flow[i]) / (time[i+1] - time[i]) to the whole dataset:

flow = np.diff(values)
time = np.diff(dates).tolist()
time = np.divide(time, np.power(10, 9))

slopes = np.divide(flow, time) # (flow[i+1] - flow[i]) / (time[i+1] - time[i])
slopes = np.insert(slopes, 0, 0, axis=0) # Since we "lose" the first index, this one is 0, just for alignments

And finally to detect the peaks we reduced the data to rolling windows of x seconds each. That way we can detect them easily:

# ROLLING WINDOW
size = len(dataset)
rolling_window = []
rolling_window_indexes = []
RW = []
RWi = []
window_size = 240  # Seconds

dates = [i.to_pydatetime() for i in dataset['date']]
dates = np.array(dates)

# create the rollings windows
for line in range(size):
    limit_stamp = dates[line] + datetime.timedelta(seconds=window_size)
    for subline in range(line, size, 1):
        if dates[subline] <= limit_stamp:

            rolling_window.append(slopes[subline])  # Values of the slopes
            rolling_window_indexes.append(subline)  # Indexes of the respective values

        else:

            RW.append(rolling_window)
            if line != size: # To prevent clearing the last rolling window
                rolling_window = []

            RWi.append(rolling_window_indexes)
            if line != size:
                rolling_window_indexes = []

            break
else:
    # To get the last rolling window since it breaks before append
    RW.append(rolling_window)
    RWi.append(rolling_window_indexes)

After getting all rolling windows we start the fun:

t = 0.3  # Threshold
peaks = []

for index, rollWin in enumerate(RW):
    if rollWin[0] > t: # If the first value is greater of threshold
        top = rollWin[0] # Sets as a possible peak
        bottom = np.min(rollWin) # Finds the minimum of the peak

        if bottom < -t: # If less than the negative threshold
            bottomIndex = int(np.argmin(rollWin)) # Find it's index

            for peak in range(0, bottomIndex, 1): # Appends all points between the first index of the rolling window until the bottomIndex
                peaks.append(RWi[index][peak]) 

The idea behind this code is every peak has a rising and a falling, and if both are greater than the stated threshold then it's an outlier peak along with all peaks between them:

enter image description here

enter image description here

Where translated to the real dataset used, posted on github: enter image description here enter image description here