I have written my own implementation of LOF and I'm trying to compare results with the implementations in ELKI and RapidMiner, but all 3 give different results! I'm trying to work out why.
My reference dataset is one-dimensional, 102 real values with many duplicates. I'll try and post it below.
First, the RapidMiner implementation. The LOF scores are wildly different from ELKI and from my results; many come back with a LOF of infinity. Has this implementation been validated as being correct?
My results are similar to ELKI, but I don't get exactly the same LOF values. From a quick scan of the comments in the ELKI source code, I think this may be because of differences in the way the k-neighbourhood is calculated.
In the LOF paper, the MinPts parameter (elsewhere called k) specifies the minimum no. of points to be included in the k-neighbourhood. In the ELKI implementation, I think they are defining the k-neighbourhood as exactly k points rather than all the points within the k-distance or k-distinct distance. Can anyone confirm exactly how ELKI constructs the k-neighbourhood? Also there is a private variable which allows the point itself to be included in its own neighbourhood, but it looks like the default is not to include it.
Does anyone know of a public reference dataset which has the LOF scores attached for validation purposes?
--- more details follow ---
Reference: ELKI source code is here:
http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/de/lmu/ifi/dbs/elki/algorithm/outlier/lof/LOF.java
RapidMiner source code is here:
Here is my test dataset:
4.32323 5.12595 5.12595 5.12595 5.12595 5.7457 5.7457 5.7457 5.7457 5.7457 5.7457 5.97766 5.97766 6.07352 6.07352 6.12015 6.12015 6.12015 6.44797 6.44797 6.48131 6.48131 6.48131 6.48131 6.48131 6.48131 6.6333 6.6333 6.6333 6.70872 6.70872 6.70872 6.70872 6.70872 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 6.77579 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.03654 7.10361 7.10361 7.10361 7.10361 7.10361 7.10361 7.10361 7.10361 7.15651 7.15651 7.15651 7.15651 7.15651 7.15651 7.15651 7.15651 8.22598 8.22598 8.22598 8.22598 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538 8.5538
For example, I get the following LOF score for the first number (4.32323):
- RapidMiner: infinity (with MinPts lower/upper bounds set to 10,100)
- ELKI: 2.6774 (with k = 10 and distfunction/reachdistfunction set to default)
- My implementation: 1.9531
Some more details on what my implementation is doing:
- MinPts is 10, so I'm finding the 10 distinct neighbours of the point. So the neighbourhood of 4.32323 is actually 48 points, from 5.12595 up to 6.77579.
- That gives me a k-distinct distance of 2.45256
- I'm calculating the reachability distance of the first neighbour as 1.58277
- I'm calculating the LRD of the sample as 1/(99.9103/48)
- Sum of lrd(o)/lrd(p) for all 48 neighbours is 93.748939
- Divide by 48 to get a LOF of 1.9531