4
votes

I am trying to use DBSCAN sklearn implementation for anomaly detection. It works fine for small datasets (500 x 6). However, it runs into memory issues when I try to use a large dataset (180000 x 24). Is there something I can do to overcome this issue?

from sklearn.cluster import DBSCAN
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

data = pd.read_csv("dataset.csv")
# Drop non-continuous variables
data.drop(["x1", "x2"], axis = 1, inplace = True)
df = data

data = df.as_matrix().astype("float32", copy = False)

stscaler = StandardScaler().fit(data)
data = stscaler.transform(data)

print "Dataset size:", df.shape

dbsc = DBSCAN(eps = 3, min_samples = 30).fit(data)

labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)

df['Labels'] = labels.tolist()

#print df.head(10)

print "Number of anomalies:", -1 * (df[df.Labels < 0]['Labels'].sum())
1
Unfortunately, the sklearn implementation is worst-case O(n^2) (this is not standard DBSCAN but due to vectorization for sklearn; e.g. ELKI only uses O(n) memory). You can either use a low-memory implementation, add more memory, and try using a smaller eps. 3 on standardized data looks much too large!Has QUIT--Anony-Mousse
Okay. Let me try different parameters. Thanks for the response. I am hoping that there is some python implementation which is efficient before I try ELKI or R.Nira
I changed the parameters to: dbsc = DBSCAN(eps = 1, min_samples = 15).fit(data) It takes 10GB of memory and 25min, but works fine. Thanks again.Nira

1 Answers

2
votes

Depending on the type of problem you are tackling could play around this parameter in the DBSCAN constructor:

leaf_size : int, optional (default = 30) Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

If that does not suit your needs, this question is already addressed here, you can try to use ELKI's DBSCAN implementation.