0
votes

I have a data frame with a string of values, with certain anomalous readings I want to identify. I would like to make a third column in my data frame marking certain readings as "anomaly", and the rest as "normal". Looking over a plot of my data, by eye it seems pretty obvious when I get these odd dips, but I am having trouble figuring out how to get R to recognize the odd readings since the baseline average changes over time. The best I can come up with is three rules to use to classify something as "anomaly".

1: Starting with the second value, if the second value is within a close range of the first value, then mark as "N" for normal in the third column. And so on through the rest of the data set.

2: If the second value represents a large increase or decrease from the first value, mark as "A" for anomaly in the third column.

3: If a value is marked as "A", the following value will be marked as "A" as well if it is within a small range the previous anomalous value. If the following value represents a large increase or decrease from the previous anomalous value, it is to be marked as "N".

This was my best logic I could come up with, but looking at the data below if you can come up with a better idea I'm all for it.

So given a dummy data set:

SampleNum<-1:50
Value <- c(1, 2, 2, 2, 23, 22, 2, 3, 2, -23, -23, 4, 4, 5, 5, 25, 24,
           6, 7, 6, 35, 38, 20, 21, 22, -22, 2, 2, 6, 7, 7, 6, 30, 31, 
           6, 6, 6, 5, 22, 22, 4, 5, 4, 5, 30, 39, 18, 18, 19, 18)

DF<-data.frame(SampleNum,Value)

This is how I might see the final data, with a third column identifying which values are anomalous.

SampleNum Value Name
     1     1    N
     2     2    N
     3     2    N    
     4     2    N
     5    23    A
     6    22    A
     7     2    N
     8     3    N
     9     2    N
    10   -23    A
    11   -23    A
    12     4    N
    13     4    N
    14     5    N
    15     5    N
    16    25    A
    17    24    A
    18     6    N
    19     7    N
    20     6    N
    21    35    A
    22    38    A
    23    20    N
    24    21    N
    25    22    N
    26   -22    A
    27     2    N
    28     2    N
    29     6    N
    30     7    N
    31     7    N
    32     6    N
    33    30    A
    34    31    A
    35     6    N
    36     6    N
    37     6    N
    38     5    N
    39    22    A
    40    22    A
    41     4    N
    42     5    N
    43     4    N
    44     5    N
    45    30    A
    46    39    A
    47    18    N
    48    18    N
    49    19    N
    50    18    N
1
Going by your rules, row 2 should be an anomaly, since a value of 2 represents (at least) a 50% increase/decrease from a value of 1 (i.e., it's a 100% increase).jbaums
You are correct. Perhaps % increase or decrease is not the best way to characterize large jumps in the data. Question is edited to remove references to %Vinterwoo

1 Answers

1
votes

You need to distinguish anomalies from mixtures of different distributions. This is usually NOT a statistical question but rahter soemthing that comes from domain-specific knowledge. If you plot the desnity estimates from you data you get:

png(); plot( density(DF$Value)) ; dev.off()

enter image description here

So how are we supposed to know that the two values below zero are not real? They are 4% of your sample so applying a rule: "anomalies == items being outside the 99% confidence interval" would not define them as "anomalies. Are these activity measurements of some sort where the instrument should have given a positive value? The much larger bump peaking at 20 is surely not an anomaly by any reasonable definition.

You should do some searching on the topic of statistical prcess control. There are R packages with SPC oriented functions in them.